On DNS explosions — Ryn Daniels

Once upon a time, several months ago, I got a very worrying text message from one of my coworkers. "Hey, can you hop online? All of our DNS entries are disappearing."

As it turns out, not all of our DNS entries were disappearing. Just the ones for our 8 core production Mongo servers. You know, kind of important servers, in the grand scheme of things. Looking in Route53, we saw that the DNS entries had been overwritten so they pointed to an empty string instead of the public IPs of the EC2 instances. We manually added the entries back. And then they disappeared again. Turns out every chef-client run caused this to happen, so we disabled chef-client on these boxes, added the DNS entries again, and went back to bed.

Later investigation revealed that these 8 boxes were our only hi1.4xlarge instances, and the fact that chef-client was causing this made me suspect ohai. Running `ohai ec2` returned {} instead of the normal json full of metadata about the instance. Metadata that included, for example, the public hostname of the instance. Stepping through the ohai code, I realized that the AWS metadata was returning a 404 for something called metrics/vhostmd, and that made ohai just crap out and return nothing at all, instead of returning nothing only for the thing that 404d.

In addition, the Route53 chef provider we were using only checked to see if the public hostname was different than what was in Route53. Empty is technically different than a hostname, so the entry kept getting deleted because chef was helpfully trying to update the information. Eventually, we updated our Route53 cookbook with a check for this:

Making sure that nil values in ec2 metadata don’t wipe out DNS entries.

But because of various Reasons, Chef remained disabled on these 8 boxes. We worked with AWS Support on this, and eventually they told us that they had fixed the metadata on their end, and all we had to do was reboot all the servers. At which point you might tell me that you can step down a primary node in a replica set and one of the secondaries will seamlessly take over and everything will be glorious because it's so full of delicious webscale sauce, but no. Due to things like the nature of how PyMongo handles reconnections and how many PyMongos we have and things like this, stepping down the primary or removing a secondary member will trigger an election and then we'll see a rash of 500 errors as the various PyMongos figure out that this has happened and reconnect so to avoid that shittiness for our customers, this has to happen in a maintenance window, so we decided to wait until the next time we had to upgrade the cluster. Luckily, 2.4.9 provided that opportunity.

The First Law of Yaks is that there are always more yaks than you expect. Such yaks for us included:

Software RAID, which had been configured on these boxes in ancient times. Our /data partition had been mounted at /dev/md127, but our Chef recipe was expecting /dev/md0, and the first box we rebooted just didn't mount that at all. Since /data was where all the Mongo data lived, Mongo helpfully decided to try to sync in from one of the secondaries (a backup, priority 0 member that doesn't have SSDs) and, in our early-morning caffeine-deprived state, we kind of figured (before we realized the RAID array hadn't mounted) that yeah, just syncing everything in from scratch from the shittiest secondary probably was something Mongo would do, because webscale/yolodb/whatever. Luckily we found the data (mostly right where we left it!) and saved the cluster, but there was a baby yak who helpfully pointed out that we should set up aggregates for several of our Sensu checks. And there was a disk readahead yak. And that’s why I’ve been cursing everything.