On yak zones and time yaks
It's been a while since I've done a full yak-induced storytime, hasn't it? Let me tell you a tale.
Once upon a time, last Thursday, during my first on-call week at Etsy, I got paged because Elasticsearch was having an issue with one of its indices. The details of that aren't important right now. What is important is that Avleen discovered that the ES master was in EST, while the rest of the cluster was in UTC and that was causing Problems. We fixed the index, the issue resolved, and I made a ticket to look into why that server was in EST in the morning when I was more rested.
Friday morning, I discovered that not one, but two of the three potential ES master nodes was in EST. Laurie saw the ticket I had created and said, "What? This shouldn't be possible!" I agreed, it shouldn't have happened according to what we knew about our provisioning process, but there those two servers were. I did a quick knife ssh to see if there were any other servers in EST that shouldn't be. Shockingly, there were - about 30 in total. And all of them were CentOS 7 machines, which were a somewhat recent addition to our infrastructure (which at this point is mostly CentOS 6).
I checked all of our CentOS 7 machines (about 60 of them) and discovered that only half of them were in EST, while the other half were in UTC like they were supposed to be. This didn't really seem like an improvement to the situation.
Maybe some of the machines had been rebooted recently and something had changed on reboot, we thought. But there was no correlation between uptime and timezone. When were the machines built, we asked? I found that the machines built more recently (in 2015) were correctly in UTC, while the older ones (from November and December 2014) were the ones in EST. Nothing changed in our build process around then, but we did find some odd things there.
Our kickstart specified America/New_York, then a Chef recipe changed it to be UTC later (with a mysterious comment mentioning that it shouldn't be GMT, for reasons related to leapseconds that nobody could remember the specifics of). But that recipe hadn't changed recently either - what was going on with those 30 servers? I knew that the /etc/localtime file was what determined the server's timezone, so I looked at that file across all the CentOS 7 servers. All but 2 were a symlink to /usr/share/zoneinfo/America/New_York. But half those servers were in UTC, not EST as the New York zonefile would suggest. (The remaining two servers had their /etc/localtimes symlinked to /usr/share/zoneinfo/UTC and were in UTC, in the one sensible thing we discovered all week.) How would servers with the same /etc/localtime file be in different timezones? On a whim, I ran ls -al on the New_York zoneinfo file on all the servers, and guess what I found?
The contents of the Americas/New_York zonefile on half the servers was actually that of the UTC zonefile! "BUT WHY?" We yelled, "WHO WOULD DO SUCH A THING?" Laurie helpfully pointed out that, on all the CentOS 6 servers, the /etc/localtime file was just a regular file, not a symlink. That wasn't anything we'd changed, but we had a hypothesis. If a file is a symlink, and you change that file, what happens?
So if the /etc/localtime file was a symlink, and something wrote to /etc/localtime, it would end up writing over what the symlink pointed to (Americas/New_York). What was writing to that file? That Chef recipe that I mentioned copied the UTC zoneinfo file to /etc/localtime, assuming it was a regular file and not a symlink. When did that change happen though? I traced the timezone command from our kickstart files to timezone.py in anaconda. On the RHEL6 branch:
A file copy. And on RHEL7:
A symlink. I KNEW IT. I KNEW THOSE YAKS WERE IN THERE SOMEWHERE.
But there's still one last mystery. Why did only half these servers - the older ones - have their /etc/localtime files overwritten? That Chef recipe was running on all the CentOS 7 servers, not just the older ones. It turns out our good friend tzdata is to thank for that. The old servers, being old, had gotten an update to the tzdata package, which updated the New_York file to the correct contents, while newer servers that hadn't gotten an update and so still had our hacky UTC file content. We can verify this by using the handy -l option to rpm to find out what files the package contains.
This, of course, hadn't affected the CentOS 6 boxes because, again, no symlinks, so Americas/New_York could be updated without messing with anything else.In the end, we decided to fix this by doing the sensible thing and having the kickstart file just specify UTC in the first place and get rid of the Chef "fix" entirely. Huzzah. The moral of the story: Timezones are hard, symlinks are symlinky, and yaks are exceptionally hairy this time of year.