On simplifying and monitoring

This is a story of complexity and monitoring, otherwise known as the Tale of the Blip.

One fateful Thursday, approximately 2 weeks ago, at exactly 4:27 pm, PagerDuty informed us that our site was down. We checked, and it was, so people immediately jumped on to figure out why. We noticed that all the web and API nodes had dropped out of their respective load balancers in AWS, but pretty much as soon as we got that far, the nodes all came back into the ELBs, the site came back up, and everything was back to normal. We noticed several emails indicating 500-errors (our surprisingly effective way of keeping the devs aware of when their changes break things) showing connection timeouts between the web and the API. The timing of those was suspicious, but API connection timeouts are a thing that happens more often than never, and as far as we knew they'd never caused the entire site to go offline.

Fast-forward a few hours to a company happy hour. We're all having a good time, drinking beers and playing shuffleboard when BLIP! The site is down, all the web nodes had dropped out of the ELB, but this time the API nodes were still in theirs, and again everything returned to normal within a minute or two. Why were a few API connections timing out causing the entire site to go down? Why were web and(/or) API boxes dropping out of their ELBs all at the same time?

Some further investigation (and head-desking, and drinking) clued us into the fact that during these blips, we were able to access the web and API machines directly, it was going through the ELBs that was the problem. After determining that it wasn't some issue with the ELBs themselves or some peanut butter in the tubes somewhere in AWS, we figured there had to be something going on with the health check that we'd defined to let the ELB know when nodes were healthy or not. The ELBs were configured to talk to a /health URL on each box, behind which was a homegrown service called healthd. Because each API server ran 2 API services (cleverly named API1 and API2, with nginx routing requests to the proper service), healthd asked each service if it was healthy and only returned a 200 OK to the ELB if both services returned OK individually. Some third-party API calls in just API2 had been timing out, which made healthd time out, so the ELB assumed the box was dead.

Why did this happen on every box at once?

What changed at 4:27 pm on that fateful Thursday to make this start happening?

And how did our monitoring not let us know that something was going this wrong?

As it turns out, the first time this had actually happened was in April 2013. Searching the archives of our 500-error emails showed that it also happened 4 times between August and October 2013, again each time for no more than a couple minutes. But so much time had passed in between these incidents that nobody was able to put the pieces together until things started going really wrong. It wasn't that we made a deliberate change at 4:27 pm. It was simply that we had higher traffic that day than we had ever had in the history of the company, which was enough strain on our system to finally bring this subtle problem to light, and now instead of disappearing back into the ether from whence they came, the blips continued for the better part of a week and a half.What went wrong?

Unintentional complexity.

The decisions that led to nginx routing traffic between 2 API services (one on Apache, one Tornado) were not the result of deliberate design, but the consequence of a project to replace API1 with API2 that, while its intentions were good, ended up being impractical and was tabled halfway through, leaving the system in a state that it was never really designed to be in. To make things worse, no consideration was given before this transition was started as to how the whole thing was going to be monitored.

The more complex a system is, the more complex it becomes to accurately monitor the system. Each interaction between different parts of the system is a potential point of failure. The ELB only had insight into one thing, the result it got from healthd. But healthd was a black box, and at the time didn't log its failures, which meant that when one of the components it was checking failed, it gave no indication which one it was, which naturally made it that much harder to track down the actual problem. One of the first things we did was to add logging to healthd and send those logs to Loggly. Lessons learned: Log your failures. When something broke, you want to know exactly what broke, in as much detail as is reasonable. You don't have to note every single detail possible, but enough to know which part of a multi-part system is having issues. Don't deliberately build black boxes into your system if you can help it.

Speaking of black boxes, let's talk about AWS! The great thing about the cloud is that it's largely not your problem. The bad thing about the cloud is that sometimes you want it to be your problem so you can track it down and fix it. We finally resolved these issues 2 days before Amazon announced their new feature of access logs for ELBs, and before we figured out the actual root cause, the symptoms seemed to point to an issue in one of those two ELBs. These logs wouldn't have identified our problem, but they would have eliminated a few things that it was not, which would have sped up the debugging process. Not every business is at the point where they can have their own data center and run everything in-house so they have insight into these kinds of things, and that's fine. But when you find yourself using some third party solution or service where you don't have total visibility into what it's doing, you should at least consider how you might monitor it.

Something else our monitoring had previously been lacking was detailed information as to which of the API2 calls was responsible for the timeouts we were seeing. As had been the case with healthd, we knew *something* was failing, but didn't know which particular thing it was. So once again: When you have a system with many moving parts, your monitoring should go into as much detail about the individual parts as is reasonable without overwhelming yourself. When we configured our API to send response times for individual handlers to Datadog, we were able to within minutes figure out which API calls were causing problems and deal with the individually.

Our mid- and post-mortems for the Blips of course addressed our monitoring as well as the underlying architectural issues that caused this bug. While the immediately actionable items have already been taken care of (such as the response times for individual API handlers and the healthd error logging), we now have a checklist of things to keep in mind in the future when developing and deploying complex systems like these. Complex systems are not necessarily bad, but they should be used only when they add value, and when they are used, each component in them should be monitored individually. You can't necessarily tell from a failure of the whole which part is at fault- especially without logging!

Your monitoring systems will need to grow and develop alongside the systems they are monitoring. There needs to be enough communication between whoever is developing new systems and features (probably devs) and whoever is primarily responsible for the monitoring (probably ops) to make sure nothing goes into production without being monitored. Ideally, the question of how to monitor a system, especially a complex one, will have been discussed from the beginning of the project, evolving along with it as requirements and features change. If you ever have something go into production and your ops team didn't even know about it, you've got too many silos and not enough communication. With our cross-functional project-centric teams, it's a lot easier to avoid those walls and silos that are all too familiar and make sure that both dev-types and ops-types are aware of what's being built and how it's being monitored throughout the product life-cycle.

And that's the story of why the engineering team flinches whenever they hear the word 'Blip'.

arguing with computers, blogRyn DanielsMarch 12, 2014