Superpower #16: Resiliency
What Ultramarathons and 100 miles in the Rocky Mountains taught me about Systems Design
I fell into the chair, broken, demoralized, defeated, and held out my arm. The steward asked, “Are you sure?” I nodded and said “Yes”. In a single motion, he cut the band from my wrist. My race was over. I had voluntarily dropped from the Leadville 100. That was about 5:45 PM. Less than an hour later I was kicking myself and wondering what the hell happened.
What is the Leadville 100?
“Leadville” as it’s known in the ultra-running community is a 100-mile trail race in Leadville Colorado. The course traverses the Rocky Mountains, ranging from 9,500’ to well over 12,000’ altitude. It’s one of the oldest 100-mile races in the United States and a bucket-list race for anyone who’s spent time out on the trail.
I gained entry in 2016 and spent 8 months scaling my training to meet the challenge. I ran high-intensity intervals and long runs on the weekend. I ran back-to-back on tired legs. I practiced with my gear. I tested different nutrition. Everything was good. So why did I find myself voluntarily quitting halfway through what should’ve been a dream event?
In the months that followed, I did a lot of soul-searching and realized I had scaled my training, but I’d missed a fundamental element of all ultras; Resiliency.
My training had prepared me for the race. As long as everything went smoothly I’d have a good day. But ultra marathons will teach you nothing ever goes smoothly. So it is with engineering infrastructure and software systems. We design for the happy path and try to design for scale but it’s just as important to consider failure cases. It’s critical to build Resiliency into the solution in addition to Scale.
What is “Resiliency”?
We often think of things as either “working” or “broken” but a “resilient” system will keep functioning despite pieces of it being unavailable. My mental model for this is the T-800 in the original Terminator movie. Arnold's character. It never stopped going after Sarah Conner, even when its legs had been blown off.
When we prepare for failure we can be surprised by success (the reverse leads to tears).
The military has a term; “Embrace The Suck”. It’s a thumb in the eye of Adversity. Don’t be surprised when bad turns to worse. Expect any given situation to degrade and keep moving forward despite it. No matter what.
For running we can develop this muscle through training in less-than-ideal conditions. Get out and run when it’s raining and cold. This is sometimes called “mental callusing”; toughening your mental game along with your physical fitness. We can apply the same principles to Systems Engineering.
With Software Engineering you plan for failure and test what your system does under various conditions. Your goal is to degrade gracefully despite impaired functionality.
A Path towards Resiliency;
Understand the landscape - Start with your core use cases. What things are absolutely critical to the customer experience? And what can you live without?
Dependencies - For the core use case identify all critical dependencies and make sure they’re well understood by the team.
Flip switches (in a controlled manner) - Inject failures and simulate dependency outages using a proxy (e.g. Charles or Flipper). How does your client behave whilst a service it’s dependent on degrades or becomes unavailable? or returns a malformed response? Does it crash? or handle it with a well-designed user flow?
On the client side - Add exception handling where appropriate - client software should never crash because a backend service times out or is unavailable. Catch the exception and inform the user when appropriate. This builds resiliency into your client but now you may have a new set of problems…(see the next bullet).
Retries and Exponential Backoff - when a partial backend outage occurs but the client that calls it stays up you may have a surprising problem; a naive client will just continue to retry and retry to hit the service until that service is restored. This can lead to a self-inflicted denial-of-service. The fix might be to use an exponential back-off approach in your retry logic so that the longer the outage endures the less traffic is being received by the unavailable service. This gives the service a chance to recover.
Practice Recovery - When the service finally does come back you may have a “thundering herd” problem as all clients are now suddenly hitting the newly restored service. This is a difficult problem to mitigate but you can find the right strategy through practice in real life. Sometimes you can dial up the restored service through traffic shaping at your network ingress. Allowing caches to prime and the service to achieve stability before handling the full user traffic.
Redundancy - maybe your infrastructure has a single point of failure? If that resource becomes unavailable do you have a secondary source? A simple example might be failing over to a static version of data should the realtime source become unavailable. That static version needs to be super-reliable and a simple solution is a static JSON file hosted on AWS S3 that mimics the response from the service that’s unavailable. Granted this data will likely be stale but your customer experience is only degraded, not broken and you’ve avoided a full outage. Be sure to should test all these failover scenarios under real world conditions.
Early Detection - Most alarms go off when something’s wrong. That’s usually too late. Consider adding preemptive alarms that anticipate failures when key metrics start trending away from their SLAs. Early detection is always easier to deal with than a full blown outage.
I hope this list sparks some ideas that are relevant to your specific situation and maybe gives you a point from which to start.
Building software and shipping product is an iterative process. What starts out as a simple experience is enhanced and extended over time. For a mature system this can mean there are a lot of new features that have been added. If we’re not careful this can lead to a fragile solution whereby a series of small failures in sub-systems can bring down the entire infrastructure.
That day in 2016 I had experienced a series of issues, any one of which was tolerable but each one of them caught me off guard and chipped away at my mental fortitude. I didn’t have the resiliency I needed to keep everything rolling. By mile 40 I was mentally defeated. Over the last 10 miles I made peace with the notion I was going to drop. Physically I could’ve continued but mentally I was done. System shutdown!!!! So I quit.
Building resiliency in systems, in ultras, in life requires perseverance and the right mindset. Assume the "happy path” is the exception rather than the rule. Your goal is to minimize the impact of any single failure and keep the system running smoothly. When we prepare for failure we can be surprised by success. The reverse leads to tears.
In 2018, I returned to Leadville and finished the race. I encountered just as many problems as my first attempt but the difference was entirely mental. Blisters, nausea, altitude, rain, sunburn, I was not phased by any of it. At one point I almost fell off the mountain when a tree I leaned on fell over. It didn’t bother me. Things went off the rails but I adjusted and kept moving. Despite all the problems, or maybe even because of all the problems, that race was one of the best of my life.
The key lesson? keep running even when the wheels fall off...
Another great piece Francis! Perfect analogy. As a former triathlete, hitting walls was a common issue. Whether it was in the water on the bike or on my feet, getting thru the motions without pain or mental fatigue was always an issue. I find your best practices to be very helpful and comprehensive.
Good one! Loved reading this.