This post is first in a short series I’m calling the “SRE Toolkit”, each entry being a friendly introduction to one concept I’ve consistently found useful in my quest to make software sturdier. Up first – how to be good at failing!
So, you’re at the train station.
This is a big station, think Grand Central Terminal in Manhattan (which services 67 tracks and hundreds of trains per hour during peak times), so we’ve got a lot of trains coming through. And these trains travel along several different routes; let’s say there are distinct 4 major lines (North, South, East, and West) that all terminate at our big station in the middle of town. Let’s also add that, like the real Grand Central, there is no consistency about which train lines arrive on which tracks. If your home is along the North line, be prepared to run to any individual track once you’ve left work, they’re all fair game.
Everything’s humming along well enough, until one unlucky Friday afternoon, a snowstorm hits / cow falls over / alien colony invades and takes out a big section of the West line.
This… creates a bit of an issue at our station. Inbound West trains are arriving, but then the controller tells them they can’t leave, since the rails need to be kept clear for workers. So they start to build up, taking space in any available track at the station, until eventually, all the tracks are full of West trains, and poof – service for the North, East, and South lines is totally out, since there’s no place for them to go. Even though there was no disaster on those lines, 100% of folks are now stuck, even though 75% of them could be getting home if we had just kept a few tracks open.
What we’re experiencing is the problem with not isolating your failure domains.
In the world of software development, a “failure domain” is an abstract concept which refers to a subset of your application that might fail.
This can certainly mean an entire subsystem of your service – all of your API servers or databases or the like. I like to call these horizontal failure domains. However, more interesting (in the context of this blog post, at least) is the vertical failure domain, which will span multiple subsystems, but only occupies a slice of each one. Think at a grocery store – the cash registers, little conveyor belts, and employees are each horizontal domains, but the checkout lanes (each consisting of one register, one belt, and one employee) make up a vertical domain.
The most common examples in tech of vertical failure domains are at the cloud infrastructure level (for instance, Availability Zones in AWS), but they’re usually just as applicable at the application level (for instance, different customers signed up for your SaaS).
When we say it’s good to “isolate” the failure domains from each other, it just means that we should strive to make sure that one independent part blowing up doesn’t take down all the other parts with it. So to reference our previous examples, the grocery store is doing it well, because when one register fails or an employee doesn’t show up for work, all the other lanes are unaffected! But your SaaS might not be, since one customer suddenly sending a whole bunch of malformed requests might take down your platform for everyone else (unless you’ve done your architectural diligence).
Or to settle our imaginary train station’s problems, we’d just need to set up dedicated tracks for each line in our terminal. Then, a catastrophic issue on the West line couldn’t cause those trains to overwhelm the terminal, because the hindered trains would fill up about 1/4 of the tracks (the West allotment), and then no more, allowing the other lines to continue operating normally – and most people to make it home.
Accomplishing this can be tricky, since it almost always involves taking nice, easy, shared infrastructure, and breaking it up into a bunch of different pieces to manage (I’ve heard them called “pools”, “shards”, “tracks”, or “cells”). But when every outage starts being a partial outage instead of a full-blown dumpster fire, you’ll be glad you did!