Joseph Choe

Outage

About two years ago, the company I worked for had an incident that shut down service to our main website and application for nearly 24 hours. Even when service was restored, the application suffered degrading performance problems for several weeks after that.

As I’ve mentioned before, problems of this scale and magnitude happen due to a series of successive failures, though most of those are invisible and left to fester until they compound upon one another into something catastrophic.

Here’s what happened.

Incident Report

On the morning of the incident, a developer deployed a code change into production. Normally, this is something that happens several times a day. However, this developer had committed a code change that had a critical bug that would throw exceptions for all asynchronous background jobs.

At the same time, the same developer updated the control plane version to our managed Kubernetes services. I’m sure there are those among you who can see where this is going, but let me spell it out for you.

These two actions caused a series of cascading failures.

Because of the critical bug, all asynchronous background jobs started failing. Kubernetes pods began to go offline, because these failures. Eventually, the entire node group was drained and knocked offline. No new pods could even get started on that node group.

Normally, EKS would provision new nodes in that case. But because the developer had updated the control plane version that morning, the control plane and the node group were different versions from each other, meaning they could not communicate with one another. The node group remained in an unusable state for several hours.

AT the time, we were using an architecture that consisted of a server-side rendered React application and a Python Django API backend. Because the Django backend began throwing errors and the node group was drained, even the frontend application was affected as they were on the same node group. We lost those pods as well.

At the very same time, the iOS developer, a separate person from the developer from the above, had submitted our iOS application for review at the App Store. Because our entire application was down, the review team rejected our application. If you understand anything about the iOS App Store, each submission was a whole process that could take several days to weeks. We had just shot ourselves in the foot and wasted several days.

Key Takeaways

On the surface, I think it’s easy to say that the developer should have tested his code. And he should have. Everyone should test and execute every line of code that they change and commit to the repository. Too many developers rely on QA teams to catch these sorts bugs, but the first line of defense against bugs is to not code them in the first place.

However, the problem during that particular incident runs even deeper than that. Like I’ve said before, failures like the ones above don’t happen in isolation and aren’t the fault of any one person, but the result of years of systemic failures.

We had a small team of five developers at the time. When you have a team that small, it’s entirely possible for every developer to know what every other developer is doing. Yet that was precisely not the case. The lack of communication was obvious to see, since no one except the iOS developer knew about the impending App Store review.

Because of this lack of communication, no one knew that the Kubernetes control plane version had been updated, except for the developer in question. And since that developer didn’t know how to solve the problem, nor could begin to articulate all the things he had done or what the source of the problem was, I had to spend several hours troubleshooting things on my own.

Had I known about all of these things, I could have solved the problem much more quickly than the 24 hours that it took. I could have immediately provisioned new node groups to the newer control plane version. I could have provisioned separate node groups for the React application and the Django application so that errors in one wouldn’t interfere with the other. And so on and so forth.

There’s a concept I’ve talked about before rather tangentially called capacity or work in process. This doesn’t also apply to workers but also to teams. Having too many axes of change can be detrimental to teams, especially if they don’t talk to one another. They begin to bump into one another and step over each other’s work.

But why don’t they talk to one another? Not because they dislike one another, but because the work they’re doing has nothing to do with the other. When you try to utilize a worker or team to 100% capacity, these sorts of silos begin to build up.

Everyone has deadlines to keep after all, so they all hunker down and bang away at the keyboard until no one cares about what anyone else is doing. Bad decisions get made or these decisions are not communicated with the rest of the team because they’re made privately when coding or when discussed between some part of the team but not everyone else, but then it turns out that the with the most knowledge about what might be wrong is nowhere nearby when an outage occurs.

This is a problem, as most problems are, with the leadership team.