Every conference talk I’ve sat through this year has someone breathlessly explaining SRE like it’s a revelation. SLOs! Error budgets! Toil reduction! The crowd nods along, takes notes, goes back to the office, and promptly cargo-cults the whole thing without stopping to think about whether any of it makes sense at their scale.
The principles behind SRE are genuinely good. I don’t dispute that. What drives me up the wall is watching teams of fifteen people set up elaborate error budget policies and multi-stage canary deployments when they have three services and a hundred users. You don’t need Google’s playbook. You need common sense with a bit of structure.
What SRE actually gets right
The core idea is dead simple: treat reliability as a feature, not a side effect. Measure it. Set targets. Make trade-offs explicit instead of pretending you can ship fast and never break anything.
At the fintech startup, I started by picking our most critical service – the real-time financial news API – and defining one SLO for it. Not twelve. One. “99.9% of requests succeed under 200ms.” That single number gave us a shared language between engineering and product. When product wanted to ship something risky, we could point at the error budget and have a real conversation instead of gut-feel arguments.
That’s the part people skip. The error budget isn’t a fancy dashboard widget. It’s a negotiation tool. Budget healthy? Ship bold. Budget burning? Slow down. Simple.
SLOs aren’t a religion
Here’s where the cargo-culting starts. I’ve seen teams spend weeks building SLO tracking infrastructure before they even know what their users care about. They’ll measure fourteen signals and set targets for all of them because the Google SRE book said so.
Pick the one or two things that matter to your users. For us it was latency and availability of the news feed. That’s it. We didn’t need saturation metrics for a service running on three nodes. We didn’t need a formal SLA document when our “SLA” was “don’t let the paying customers notice.”
Start with what hurts. The rest can wait.
Toil is real, but so is over-engineering
The toil concept is the best thing to come out of SRE thinking. Manual, repetitive work that scales with the system and produces no lasting value. Kill it. Automate it. Eliminate the need for it entirely if you can.
At the fintech startup we had a deployment process that involved SSH-ing into machines and running scripts in a specific order. Classic toil. We automated it with a simple CI pipeline. Nothing fancy. No custom deployment orchestrator, no multi-stage canary with automatic rollback. A pipeline that ran tests and deployed if they passed.
The temptation is always to build the Rolls-Royce version. Resist it. Automate the thing that’s wasting the most time right now. Move on. Come back later if you need more.
Monitoring: stop alerting on nonsense
I had an alert that fired when CPU hit 80%. It went off constantly. Nobody cared. It trained the whole team to ignore alerts, which is the single worst outcome a monitoring system can produce.
We ripped out the noise and replaced it with alerts tied to user impact. P99 latency above our SLO threshold for ten minutes? That pages someone. CPU is high but latency is fine? That’s a graph you check during business hours.
Four golden signals – latency, traffic, errors, saturation – are a fine starting framework. But “framework” is the key word. You don’t need all four on day one. You need the ones that tell you when users are hurting.
Incidents and postmortems
This part most teams actually do need, regardless of scale. When something breaks, have a plan that isn’t “the person who knows the most panics the hardest.”
We kept it simple. Mitigate first, investigate later. Communicate during, not after. And when it’s over, write up what happened without blaming anyone. Blameless postmortems aren’t some warm-fuzzy HR initiative. They’re how you actually learn things, because people tell the truth when they’re not afraid of getting fired.
The part nobody talks about: follow through on the action items. I’ve seen more postmortem documents gather dust than I care to admit. If you’re not going to fix the thing, don’t bother writing it down.
Progressive delivery is just good sense
Deploy to 1%, watch the metrics, expand to 10%, watch again, go to 100%. This isn’t rocket science. It’s the engineering equivalent of “taste the soup before serving it.” But you don’t always need automated canary analysis and a dedicated deployment pipeline to do it. Sometimes a feature flag and a pair of eyes on a dashboard is plenty.
Rollback capability matters more than rollout sophistication. Can you get back to the last known good state quickly? If yes, you’re in decent shape. If no, fix that before you build anything else.
The point
SRE gives you a vocabulary and a set of principles for thinking about reliability. That’s valuable. What’s not valuable is treating the Google SRE book like scripture and implementing every practice wholesale because a conference speaker said you should.
Figure out what’s actually breaking. Measure the thing that matters. Automate the task that’s eating your weekends. Write down what went wrong when it goes wrong. That’s it. That’s the whole thing, applied honestly, without the ceremony.