I’ve been doing this “reliability” stuff for a little while now (~5 years), at companies ranging from about 20 developers to over 2,000. I’ve always cared primarily about the software elements I describe as living “outside” the application – like, how does it get its configuration? What kinds of instances does it run on, and are those the best kinds to use? What steps does it take on its path from “code in a repository” to “running in production”? And I’ve always kept track of what I liked – which mechanisms allowed fast iteration and which caused frustration, which led to outages and which prevented them.

At this point, I think there are enough lessons here that it might be useful if I wrote it all down, even if it’s just something for me to reference or add to later.

Note that this list is a weird in some ways coming from an SRE. My goal here isn’t “what is 100% the most reliability-oriented way we can build things”, it’s more like “what is the 80% of reliability we can get for 20% of the effort while still enabling devs to go fast“, which gets you ultimately a system that looks pretty different. But it’s a line worth walking – if you do it well, working with production is fun, instead of miserably-safe or frighteningly-dangerous.

Also, please do me a favor and mentally prepend each of the following bullet points with the word “usually”. Every situation is unique, and just because I haven’t seen a case where (for example) using Git is a bad idea, doesn’t mean that case doesn’t exist. Only a Sith deals in absolutes, etc.

So! With that out of the way – this is how I’d rebuild it all from scratch if I could.

Coding (parameter? I hardly know ‘er)

  • No in-code fallbacks for configs. If your service can’t load the config on startup for any reason, it should just crash – that’s much easier to diagnose than the result of one borked instance going down an ancient code path because no one remembered to clean up that line config.get(enable_cool_new_thing, false) after finishing their rollout.

  • Extremely strict RPC settings. I’m talking zero retries (or MAYBE one), and a timeout like 3x the p99. We are striving for predictability here, and sprinkling retries or long timeouts as a quick fix for a flaky downstream service will turn into a week-long investigation and a migraine a year from now. Fix the flaky service!

  • Never give up on local testing. It keeps dev cycle time much shorter than needing to rely on (and fiddle with) CI or remote workspaces. Containerizing the local test environment can make it easier to keep dependencies straight and consistent across machines.

  • Avoid state like the plague. Managing a stateful service is an order of magnitude trickier than a stateless one. There are plenty of good managed DBs and caches out there, just use one of those!

Merging (where we’re going, we don’t need tests)

  • Use Git. Use it for everything – infrastructure, configuration, code, dashboards, on-call rotations. Your git repository is your point-in-time-recoverable source of truth.

  • Don’t waste time on code coverage. And box ticking exercises in general – they make for nice charts that have very little relationship with how much actual value your change is providing (see below).

  • Prioritize real-world validation. The highest-value-per-time-spent kind of test is just pushing your change to staging (or better, prod!) and showing it does what you wanted and doesn’t break everything. Second best is integration tests, with unit tests notably coming in last place – i.e. “only if you have some time”.

  • For infra changes, make plans extremely obvious. This could mean “post the Terraform plan as a comment on the pull request”, same with a helm diff. There are great tools to make sure the change you think you’re making is the change you’re actually making, so make sure they’re front and center.

  • For code changes, make regressions extremely obvious. Error logs, CPU usage, and request error rate are great signals, catching I’d guess 90% of bad versions and working for almost any service (totally generic). So don’t throw them away!

Deploying (no sleep til prod)

  • Use Docker. It’s industry standard for a reason – wrangling dependencies in environments with tools like Chef or Ansible loses to these nice self-contained artifacts any day.

  • Deploy everything all the time. Every day that goes by without you deploying increases the chances that it’s actually secretly been broken (by someone’s change, an dependency update, an third-party API removal), and it’s very hard to track down what went wrong two weeks after the fact.

  • Validate deployments as they go. Can you build a completely busted image, deploy it, and have it successfully roll out to all machines? Why? This can be fixed a number of ways, including canary/shadow deployments or even just good readiness probes.

  • Enable limited “instant” config rollouts. It might sound counterintuitive (since an “instant” rollout often means “break everything all at once”) but the ability to disable a problematic feature flag or add an IP to a blocklist in under 5 minutes more than offsets the increased risk. It enables everyone to move fast, but must be managed carefully!

Operating (my god, it’s full of pods)

  • Use Kubernetes. Assuming you have more than one service and more than one instance, you either need or will need stuff like service discovery, autoscaling, and deployment versioning, and you do not want to manage all that by yourself. Kubernetes gives infra teams scalability superpowers.

  • Use Helm. Or some other tool for managing Kubernetes manifests, I’m not picky – the important thing is that you ~never directly use kubectl apply, edit, or delete. The resource lifecycle needs to be findable in version control.

  • Avoid operators and CRDs. As stated above, I like Kubernetes, but it’s a steep learning curve for many developers, and custom operators veer hard into the “WTF is going on” territory that gives it its complicated reputation. Keep it simple.

  • Run 3 of everything. Like with backups, two is one and one is none. Additionally, ensure (like, actually verify, in production) that 2 of the 3 “things” can handle the full load by themselves – otherwise you don’t really have the failure tolerance you think you do.

  • Structured logs are non-negotiable. This plus injected trace IDs gets you 90% of the way to APM, but much cheaper and with much less work needed from developers.

So that’s the current list! I think I’ll come back here periodically and add more. Please feel free to reach out if anything here particularly upsets you or there’s anything else you would like to talk to me about :P