The Outage
Incidents and Outages can be your team's greatest source of Learning but only if it's safe to talk about them.
Does your org talk about outages? Are retros shared openly? Is it truly a blame-free environment? The answers could be a commentary on your team’s culture. Today I will tell you about one of the worst outages of my career and under the strangest of circumstances. You join us in mid-triage and the conversation is going something like this…
Engineer #1: “We’re fully down, you’re going to have to tell them.”
Francis: “I would but the Secret Service have the whole floor locked down, I don’t even think I can even get down there. ”
Engineer #2: ”Dang, you’re right, you can’t get down there without the background check.”
Francis: ”Ok, let’s just make sure? pull up that graph one more time.…and you said we added hosts?”
Engineer #1: “Yep. They just keep falling over.”
Engineer #2: “I don’t understand, there’s not even that much load right now. "
Engineer #1: “Let’s just bounce the cluster….”
Francis: “Ok, time’s up, I’m going to have to let them know”
How did we get here?
In 2017 Joe Biden (yes that Joe Biden) was visiting our offices to announce the launch of his new book “Promise Me Dad” - the heartbreaking story of Beau Biden and his battle with a malignant brain tumor. Beau was the son of Joe and Joe’s late wife Neilia Hunter. Biden was scheduled to be interviewed in our offices that day and the employees who would be near him had been pre-screened by the Secret Service weeks in advance. It was a big moment.
“Promise me, Dad,” Beau had told his father. “Give me your word that no matter what happens, you’re going to be all right.” Joe Biden gave him his word.
The morning had started much like any other and we were prepped and ready to go. Joe’s book had racked up a respectable number of pre-orders and everyone who had already purchased the book would have it delivered automatically to their libraries at the release time early that morning.
The appointed hour arrived, the system clock ticked over, and…our system fell over. Alarms started going off. Pagers began paging; We were experiencing a rolling outage. Joe Biden was on the floor below us, being interviewed by our CEO. Since Joe was onsite the entire floor of the building had been locked down by the Secret Service but somehow we had to work the issue and restore service.
The internet wasn’t too happy with us either…
How Can Outages Help You?
Outages and Incidents in general can be a source of great learning, both technically and in terms of leadership, so it’s worth reflecting on them.
When I interview people, especially senior folks I usually try to have them explain an outage they experienced. It’s one of my favorite interview questions. Weak folks (ironically) will play off the question to feign strength “I can’t recall” or “I’ve never really had that bad an incident” - a huge red flag.
Strong candidates are readily able and willing to take you back to that moment and recite chapter and verse. Being able to dive deep and explain an outage is a good sign they’ve thought about what went wrong and hopefully conducted a retrospective. It’s also a sign that they come from a culture where it’s ok to talk about incidents. This last part is foundational.
Tips to Help Manage Through an Outage
Experience is what you get when you don't get what you want. Here are some things to consider the next time you find yourself in the middle of an incident.
Stay Calm & Don’t Panic - remain composed. If you freak out everyone will freak out and in a crisis that’s not helpful.
Work the Issue - When stuff’s broken you have to focus on getting things back to normal. It’s not the time to assign blame or try to “fix” the process breakdown that got you there. Just work the issue.
Follow the Data - Walk the mental model of how the system should work, consult your graphs and instrumentation to verify and trace the issue back to the root cause. The more things you can rule out through concrete data the more cycles you can devote to productive triage.
Don’t Ignore Your “Hunch” - engineers will often have an intuition for how things work or don’t work. Doctors have a similar intuition - “this smells like a heart attack” even though they don’t have definitive data. These hunches can be helpful so don’t ignore them. Use high judgment and chase them down. If the data correlates maybe you’re on the right path.
Don’t Start the Retrospective During the Triage - it’s easy to start talking about “if only we’d done XYZ” in the moment. That’s not helpful. When the system is down your primary focus should be on restoring service.
Be Positive and Avoid Blame- as a leader, it’s your job to proactively counteract the negativity of the moment. Blaming people will shut the conversation down. That’s not what you want in the middle of an issue.
Ensure people feel safe - it’s a crisis, recognize that people are feeling their most vulnerable. This is an opportunity for you to lead. Don’t miss it.
Assign an Owner for the Retro - do this after the incident has been mitigated but BEFORE the call is closed and be sure to set a reminder somehow. (most of the tools out there will do this for you).
This Biden incident was triaged by just a handful of folks, 4 or 5 of us total. As a team we kept focused on the outage and the resolution. The details are interesting but wouldn’t be appropriate to share here. If you see me in person I’ll be happy to dive deep. I think the whole thing was mitigated by about 2PM that day and I never did have to give Joe Biden the bad news.
Through this incident it became obvious there were some design flaws in our API so we conducted a retro and prioritized the refactoring to avoid this in the future.
Celebrate your Outages and Learn from them.
In my time at Amazon I experienced an open culture where the incidents that stand out in the history of the business are shared transparently. These “Great Disasters” are a part of onboarding. Some examples include the “Hulk Hands Incident” that overloaded the order delivery system or the time a single pipe symbol brought down all of S3 in US-East. Or the cat who fell asleep on the keyboard and…well that’s another story.
In Conclusion
No career is free of blemishes and I deeply value the incidents I’ve been involved in and accountable for. That sounds weird to say but it’s true.
Incidents and Outages come at the worst times; high-profile or high-pressure launches, or when you’re out enjoying your kids’ birthday party. In the moment, they’ve all felt horrible, but I’d like to think I’ve learned something from each one and that they’ve made me a better engineer overall.
I remember this outage like it was yesterday. That was truly one for the ages…
A fantastic read!