Going in circles without a real-time clock
10 April 2024 | 8:45 pm

I have a story about paper cuts when using a little Linux box.

One of my sites has an older Raspberry Pi installed in a spot that takes some effort to access. A couple of weeks ago, it freaked out and stopped allowing remote logins. My own simple management stuff was still running and was reporting that something was wrong, but it wasn't nearly enough detail to find out exactly what happened.

I had to get a console connected to it in order to find out that it was freaking out about its filesystem because something stupid had apparently happened to the SD card. I don't know exactly why it wouldn't let me log in. Back in the old days, you could still get into a machine with a totally dead disk as long as enough stuff was still in the cache - inetd + telnetd + login + your shell, or sshd + your shell and (naturally) all of the libraries those things rely on. I guess something happened and some part of the equation was missing. There are a LOT more moving parts these days, as we've been learning with the whole xz thing. Whatever.

So I rebooted it, and went about my business, and it wasn't until a while later that I noticed the thing's clock was over a day off. chrony was running, so WTF, right? chrony actually said that it had no sources, so it was just sitting there looking sad.

This made little sense to me, given that chrony is one of the more clueful programs which will keep trying to resolve sources until it gets enough to feel happy about using them for synchronization. In the case of my stock install, that meant it was trying to use 2.debian.pool.ntp.org.

I tried to resolve it myself on the box. It didn't work. I queried another resolver (on another box) and it worked fine. So now what, on top of chrony not working, unbound wasn't working too?

A little context here: this box was reconfigured at some point to run its own recursive caching resolver for the local network due to some other (*cough* TP-Link *cough*) problems I had last year. It was also configured to *only* use that local unbound for DNS resolution.

This started connecting some of the dots. chrony wasn't setting the clock because it couldn't resolve hosts in the NTP pool. It couldn't resolve hosts because unbound wasn't working. But, okay, why wasn't unbound working?

Well, here's the problem - it *mostly* was. I could resolve several other domains just fine. It's just that ntp.org stuff wasn't happening.

(This is where you start pointing at the screen if this has happened to you before.)

So, what would make only some domains not resolve... but not all of them... on a box... with a clock that's over a day behind?

Yeah, that's about when it fit together. I figured they must be running DNSSEC on that zone (or some part of it), and it must have a "not-before" constraint on some aspect of it. I've been down this road before with SSH certificates, so why not DNS?

I added another resolver to resolv.conf, then chrony started working, and that brought the time forward, and then unbound started resolving the pool, and everything else returned to normal.

By "everything else", I also mean WireGuard. Did you know that if your machine gets far enough out of sync, that'll stop working, too? I had no idea that it apparently includes time in its crypto stuff, but what other explanation is there?

Backing up, let's talk about what happened, because most of this is on me.

I have an old Pi running from an SD card. It freaked out. It took me about a day and a half to get to where it was so I could start working on fixing it.

This particular Pi doesn't have a real-time clock. The very newest ones (5B) *do*, but you have to actually buy a battery and connect it. By default, they are in the same boat. This means when they come up, they use some nonsense time for a while. I'm not sure exactly what that is offhand, because...

systemd does something of late where it will try to put the clock back to somewhere closer to "now" when it detects a value that's too far in the past. I suspect it just digs around in the journal, grabs the last timestamp from that, and runs with it. This is usually pretty good, since if you're just doing a commanded reboot, the difference is a few seconds, and your time sync stuff fixes the rest not long thereafter.

But, recall that the machine sat there unable to write to its "disk" (SD card) for well over a day, so that's the timestamp it used. If I had gotten there sooner, I guess it wouldn't have been so far off, but that wasn't an option.

Coming up with time that far off made unbound unable to resolve the ntp.org pool servers, and that made chrony unable to update the clock... which made unbound unable to resolve the pool servers... which...

My own configuration choice which pointed DNS resolution only at localhost did the rest.

So, what now? Well, first of all, I gave it secondary and tertiary resolvers so that particular DNS anomaly won't be repeated. Then I explicitly gave chrony a "peer" source of a nearby host (another Pi, unfortunately) which might be able to help it out in a pinch even if the link to the outside isn't up for whatever reason.

There's a certain problem with thinking of these little boxes as cheap. They are... until they aren't. To mangle a line from jwz, a Raspberry Pi is only cheap if your time has no value.

As usual, this post is not a request for THE ONE to show up. If you are THE ONE, you don't make mistakes. We know. Shut up and go away.

autoconf makes me think we stopped evolving too soon
3 April 2024 | 12:31 am

I've gotten a few bits of feedback asking for my thoughts and/or reactions to the whole "xz backdoor" thing that happened over the past couple of days. Most of my thoughts on the matter apply to autoconf and friends, and they aren't great.

I don't have to cross paths with those tools too often these days, but there was a point quite a while back when I was constantly building things from source, and a ./configure --with-this --with-that was a given. It was a small joy when the thing let me reuse the old configure invocation so I didn't have to dig up the specifics again.

I got that the whole reason for autoconf's derpy little "recipes" is that you want to know if the system you're on supports X, or can do Y, or exactly what flavor of Z it has, so you can #ifdef around it or whatever. It's not quite as relevant today, but sure, there was once a time when a great many Unix systems existed and they all had their own ways of handling stuff, and no two were the same.

So, okay, fine, at some point it made sense to run programs to empirically determine what was supported on a given system. What I don't understand is why we kept running those stupid little shell snippets and little bits of C code over and over. It's like, okay, we established that this particular system does <library function foobar> with two args, not three. So why the hell are we constantly testing for it over and over?

Why didn't we end up with a situation where it was just a standard thing that had a small number of possible values, and it would just be set for you somewhere? Whoever was responsible for building your system (OS company, distribution packagers, whatever) could leave something in /etc that says "X = flavor 1, Y = flavor 2" and so on down the line.

And, okay, fine, I get that there would have been all kinds of "real OS companies" that wouldn't have wanted to stoop to the level of the dirty free software hippies. Whatever. Those same hippies could have run the tests ONCE per platform/OS combo, put the results into /etc themselves, and then been done with it.

Then instead of testing all of that shit every time we built something from source, we'd just drag in the pre-existing results and go from there. It's not like the results were going to change on us. They were a reflection of the way the kernel, C libraries, APIs and userspace happened to work. Short of that changing, the results wouldn't change either.

But no, we never got to that point, so it's still normal to ship a .tar.gz with an absolute crap-ton of dumb little macro files that run all kinds of inscrutable tests that give you the same answers that they did the last time they ran on your machine or any other machine like yours, and WILL give the same answers going forward.

That means it's totally normal to ship all kinds of really crazy looking stuff, and so when someone noticed that and decided to use that as their mechanism for extracting some badness from a so-called "test file" that was actually laden with their binary code, is it so surprising that it happened? To me, it seems inevitable.

Incidentally, I want to see what happens if people start taking tarballs from various projects and diff them against the source code repos for those same projects. Any file that "appears" in the tarball that's allegedly due to auto[re]conf being run on the project had better match something from the actual trees of autoconf, automake, ranlib, gettext, or whatever else goofy meta-build stuff is being used these days.

$ find . -type f | sort | xargs sha1sum
7d963e5f46cd63da3c1216627eeb5a4e74a85cac  ./ax_pthread.m4
c86c8f8a69c07fbec8dd650c6604bf0c9876261f  ./build-to-host.m4
0262f06c4bba101697d4a8cc59ed5b39fbda4928  ./getopt.m4
e1a73a44c8c042581412de4d2e40113407bf4692  ./gettext.m4
090a271a0726eab8d4141ca9eb80d08e86f6c27e  ./host-cpu-c-abi.m4
961411a817303a23b45e0afe5c61f13d4066edea  ./iconv.m4
46e66c1ed3ea982b8d8b8f088781306d14a4aa9d  ./intlmacosx.m4
ad7a6ffb9fa122d0c466d62d590d83bc9f0a6bea  ./lib-ld.m4
7048b7073e98e66e9f82bb588f5d1531f98cd75b  ./lib-link.m4
980c029c581365327072e68ae63831d8c5447f58  ./lib-prefix.m4
d2445b23aaedc3c788eec6037ed5d12bd0619571  ./libtool.m4
421180f15285f3375d6e716bff269af9b8df5c21  ./lt~obsolete.m4
f98bd869d78cc476feee98f91ed334b315032c38  ./ltoptions.m4
530ed09615ee6c7127c0c415e9a0356202dc443e  ./ltsugar.m4
230553a18689fd6b04c39619ae33a7fc23615792  ./ltversion.m4
240f5024dc8158794250cda829c1e80810282200  ./nls.m4
f40e88d124865c81f29f4bcf780512718ef2fcbf  ./po.m4
f157f4f39b64393516e0d5fa7df8671dfbe8c8f2  ./posix-shell.m4
4965f463ea6a379098d14a4d7494301ef454eb21  ./progtest.m4
15610e17ef412131fcff827cf627cf71b5abdb7e  ./tuklib_common.m4
166d134feee1d259c15c0f921708e7f7555f9535  ./tuklib_cpucores.m4
e706675f6049401f29fb322fab61dfae137a2a35  ./tuklib_integer.m4
41f3f1e1543f40f5647336b0feb9d42a451a11ea  ./tuklib_mbstr.m4
b34137205bc9e03f3d5c78ae65ac73e99407196b  ./tuklib_physmem.m4
f1088f0b47e1ec7d6197d21a9557447c8eb47eb9  ./tuklib_progname.m4
86644b5a38de20fb43cc616874daada6e5d6b5bb  ./visibility.m4

... there's no build-to-host.m4 with that sha1sum out there, *except* for the bad one in the xz release. That part was caught... but what about every other auto* blob in every other project out there? Who or what is checking those?

And finally, yes, I'm definitely biased. My own personal build system has a little file that gets installed on a machine based on how the libs and whatnot work on it. That means all of the Macs of a particular version of the OS get the same file. All of the Debian boxes running the same version get the same file, and so on down the line.

I don't keep asking the same questions every time I go to build stuff. That's just madness.

Port-scanning the fleet and trying to put out fires
27 March 2024 | 5:32 am

There was this team which was running a pretty complicated data storage, leader election and "discovery" service. They had something like 3200 machines and had something like 300 different clusters/cells/ensembles/...(*) running across them. This service ran something kind of like etcd, only not that.

The way it worked was that a bunch of "participant" machines would start an election process, and then they'd decide who was going to lead them for a while. That leader got to handle all of the write traffic and it did all of the usual raft/paxos-ish spooky coordination stuff amongst the participants, including updating the others, and dealing with hosts that go away and come back later, and so on. It's all table stakes for this kind of service.

This group of clusters had started out relatively simple but had grown into a monster over the years. Nobody probably expected them to have hundreds of clusters and thousands of machines, but they now did, and they were having trouble keeping track of everything. There were constant outages, and since they were so low in the stack, when they broke, lots of other stuff broke.

I wanted to know just what the ground truth looked like and so started something really stupid from my development machine. It would take a list of their servers and would crawl them, interrogating the TCP ports on which the service ran. This was only about 10 ports per machine, so while it sounded obnoxiously high, it was still possible for prototyping purposes.

On these ports, there were simple text-based commands which could be sent, and it would return config information about what that particular instance was running. It was possible to derive the identity of the cluster from that. Given all of this and a scrape of the entire fleet, it was possible to see which host+port combinations were actually supporting any given cluster, and thus see how well they were doing.

Early results from this terrible manual scrapes started showing promise. Misconfigurations were showing up all over the place - clusters that are supposed to have 5 hosts but only have 3 in practice with the other two missing in action somewhere, clusters with non-standard host counts, clusters in the wrong spots, and so on.

To get away from the "printf | nc the world in cron" thing, we wound up writing this dumb little agent thing that would run on all of the ~3200 hosts. It would do the same crawling, but it would happen over loopback so it was a good bit faster by removing long hauls over the production network from the equation. It also took the load of polling ~32000 ports off my singular machine, and was inherently parallel.

It was now possible to just query an agent and get a list of everything running on that box. It would refresh things every minute, so it was far more current than my terrible script which might run every couple of hours (since it was so slow). This made things even better, and so we needed an aggregator.

We did some magic to make each of these agents create a little "beacon" somewhere any time they were run. Our "aggregator" process would start up and would subscribe to the spot where beacons were being created. It would then schedule the associated host for checks, where it would speak to the agent on that host and ask for a copy of its results.

So now we had an agent on every one of the ~3200 hosts, each polling 10 local ports, plus an aggregator that talked to the ~3200 agents and refreshed the data from them.

Finally, all of the data was available in one place with a single query that was really fast. The next step was to write a bunch of simple "dashboard" web pages which allowed anyone to look at the entire fleet, or to narrow it down by certain parameters - a given cluster (of these servers), a given region, data center, whatever.

With all of this visible with just a few clicks, it was pretty clear that we needed something more to actually find the badness for us. It was all well and good to go clicking around while knowing what things are supposed to look like, but there were supposed to be rules about this sort of thing: this many hosts in a cluster, no more than N hosts per failure domain, and more.


Failure domains are a funny thing. Let's say you have five hosts which form a quorum and which are supposed to be high-availability. You'd probably want to spread them around, right? If they were serving clients from around the world, maybe you'd put them in different locations and never put two in the same spot? If something violated that, how would you know?

Here's an example of bad placement. We had this one cluster which was supposed to be spread out throughout an entire region which was composed of multiple datacenter buildings, each with multiple (compute) clusters in it, with different racks and so on down the line. But, because it had been turned up early in the life of that region when only a handful of hosts had existed, all of them were in the same two or three racks.

Worse still, those racks were physically adjacent. Put another way, if the servers had arms and hands, they could have high-fived each other across the hot and cold aisles in the datacenter suite. That's how close together they were. One bad event in a certain spot would have wiped out all of their data.

We had to write a schema which would let us express limits for a given cluster - how many regions it should be in, the maximum number of members per host, rack, (compute) cluster, building, region, etc. Then we wrote a tool to let us create rules, and then started using that to churn out rulesets. Next we came up with some tools which would fetch the current state of affairs (from the agent/aggr combo) and compare it to the rulesets. Anything out of "compliance" would show up right away.


Then there was the problem of managing the actual ~3200 hosts. With a footprint that big, there's always something happening. A location gets turned up and new hosts appear. Another location is taken down after the machines get too old and those hosts go away. We kept having outages where a decom would be scheduled, and then someone far away would run a script with a bunch of --force type commands, and it would just yank the machines and wipe them. It had no regard for what they were actually doing, and they managed to clobber a bunch of stuff this way. It just kept happening.

This is when I had to do something that does not scale. I said to the decom crew that they should treat any host owned by this team as off limits because we do not have things under control. That means never *ever* running a decom script against these hosts while they are still owned by the team.

I further added that while we're working to get things under control, if for some reason a decom is blocked due to this decree of mine, they are to contact me, any time of day or night, and I will get them unblocked... somehow. I figured it was my way of showing that I had "skin in the game" for making such a stupid and unreasonable demand.

I've often said that the way to get something fixed is to make sure someone is in the path of the badness so they will feel it when something screws up. This was my way of doing exactly that.

We stopped having decom-related outages. We instead started having these "fire drill" type events where one or two people on the team (and me) would have to drop what they were doing and spend a few hours manually replacing machines in various clusters to free them up.

Obviously, this couldn't stand, and so we started in on another project. This one was more of a "fleet manager", where a dumb little service would keep track of which machines the team owned, and it would store a series of bits for each one that I called "intents".

There were only three bits per host: drain, release, freeze. Not all combinations were valid.

If no bits were set on a host, that meant it was intended for production use. If it has a server on it, that's fine. If someone needs a replacement, it's potentially available (assuming it meets the other requirements, like being far enough away from the other participants).

If the "drain" bit was set, that meant it was not supposed to be serving. Any server on it should be taken off by replacing it with an available host which itself isn't marked for "drain" (or worse).

The "release" bit meant that if a host no longer had anything running on it, then it should be released back to the machine provisioning system. In doing this, the name of the machine changed, and thus the ownership (and responsibility) for it left the team, and it was no longer our problem. The people doing decoms would take it from there.

"Freeze" was a special bit which was intended as a safety mechanism to stop a runaway automation system. If that bit was set on a host, none of the tools would change anything on it. It's one of those things where you should never need to use it, but you'll be sorry if you don't write it and then need it some day.

"Drain" + "release" meant "keep trying to kick instances off this host and don't add any new ones", and then "once it becomes empty, give it back".

Other combinations of the bits (like "release" without "drain") were invalid and were rejected by the automation.

I should note that this was meant to be level-triggered, meaning on every single pass, if a host had a bit set and yet wasn't matching up with that intent or those intents, something should try to drain it, or give it away, or whatever. Even if it failed, it should try again on the next pass, and failures should be unusual and thus reported to the humans.


Then there was also the pre-existing system which took config files and used it to install instances on machines. This system worked just fine, but it only did that part of the process. It didn't close the loop and so many parts of the service lifecycle wound up unmanaged by it.

Looking back at this, you can now see that we could establish a bunch of "sets" with the data available.

Configs: "where we told it to run"

Agent + aggregator: "where it's actually managing to run"

Checker: "what rules these things should be obeying"

Fleet manager: "which machines should be serving (or not), which machines we should hang onto (or give back)".

Doing different operations on those sets yielded different things.

[configs] x [agent/aggr] = hosts which are doing what they are supposed to be doing, hosts which are supposed to be serving but aren't for some reason, and hosts which are NOT supposed to be running but are running it anyway. It would find sick machines, failures in the config system, weird hand-installed hack jobs in dark corners, and worse.

[agent/aggr] x [checker] = clusters which are actually spread out correctly, and clusters which are actually spread out incorrectly, (possibly because of bad configs, but could be any reason).

[agent/aggr] x [fleet manager] = hosts which are serving where that's okay, hosts which need to be drained until empty, and hosts which are now empty and can be given back.

[configs] x [checker] = are out-of-spec clusters due to the configs telling them to be in the wrong spot, or is something else going on? You don't really need to do this one, since if the first one checks out, then you know that everything is running exactly what it was told to run.

[configs] x [fleet manager] = if you ever get to a point where you completely trust that the configs are being implemented by the machines (because some other set operations are clear), then you could find mismatches this way. You wouldn't necessarily have to resort to the empirical data, and indeed, could stop scanning for it.

For that matter, the whole port-scanning agent/aggr combination shouldn't have needed to exist in theory, but in practice, independent verification was needed.

I should point out that my engagement with this team was not viewed kindly by management, and my reports about what had been going on ultimately got me in trouble more than anything else. It's kind of amazing, considering I was working with them as a result of a direct request for reliability help, but shooting the messenger is nothing new. This engagement taught me that a lot of so-called technical problems are in fact rooted in human issues, and those usually come from management.

There's more that happened as part of this whole process, but this post has gotten long enough.


(*) I'm using "clusters" here to primarily refer to the groups of 5, 7, or 9 hosts which participated in a quorum and kept the state of the world in sync. Note that there's also the notion of a "compute cluster", which is just a much larger group of perhaps tens of thousands of machines (all with various owners), and that does show up in this post in a couple of places, and is called out explicitly when it does.

More News from this Feed See Full Web Site