Quick take
If burning through your error budget doesn’t change how your team ships, you don’t have SLOs. You have decorative charts.
I’ve watched three different teams adopt SLOs in the past year. Two of them ended up with beautiful Grafana dashboards that nobody looked at after the first sprint. The third team actually used their error budget to cancel a feature release and fix a checkout regression instead. Guess which team had fewer incidents in Q4.
The difference wasn’t tooling. It was whether the SLO changed behavior or just measured things.
SLOs are decisions, not dashboards
At the fintech startup, we tracked uptime for our financial data API. 99.9% availability on a fancy status page. Looked great. The problem? Our SLI was measuring HTTP 200s from a health endpoint. Meanwhile, users were getting stale stock prices because our data pipeline was silently lagging by 30 minutes. By our SLO, everything was fine. By our users’ experience, the product was broken.
An effective SLO is a contract between reliability and velocity. It answers one question: can we ship this week, or do we owe the users some stability work first? If it doesn’t influence your sprint planning, kill it.
Measure what your users feel, not what your infra reports
Start with the user journey. Not the Kubernetes dashboard.
These aren’t user-facing indicators:
- CPU utilization
- Pod restart counts
- Database connection pool size
These are:
- Successful checkout completions
- Search results returned under 400ms
- API responses with correct, fresh data
The distinction seems obvious written down, but I still see teams default to infrastructure metrics because they’re easier to collect. Easier isn’t the point. Accurate is.
Start with four signals, then get specific
For most services, you can begin with availability, latency, throughput, and error rate. The classic golden signals. But don’t stop there. Availability that counts health check pings the same as checkout requests is lying to you.
A useful SLI is brutally specific:
- The request:
POST /api/v1/checkoutfrom authenticated users - What counts as success: HTTP status < 500 AND order confirmation generated
- The population: production traffic only, excluding synthetic monitors
- The window: rolling 28 days
That specificity is the difference between an SLI that catches real problems and one that hides them in averages.
Set targets from data, not ambition
I see this constantly: a team picks 99.99% availability because it sounds professional. They’ve been running at 99.2% for six months. The gap between target and reality is so large that the error budget is permanently exhausted, which means the policy attached to it’s permanently triggered, which means everyone ignores it.
A target that’s never met is noise. A target that’s always met is invisible. Neither changes behavior.
Here’s what actually works:
- Measure your current performance for 2-4 weeks. No changes, just observation.
- Set the target slightly tighter than current reality. If you’re running at 99.5%, try 99.7%.
- Adjust quarterly based on data and user feedback. Not based on what the VP saw at a conference.
Different services deserve different targets. Your payment processing endpoint can justify 99.95%. Your internal admin dashboard? 99% is probably generous. During the early Decloud days at EF, we ran our dev tooling at targets that would horrify a payments team – and that was the right call. We needed to ship fast, not polish internal tools.
Windows matter more than you think
A 99.9% SLO over 30 days gives you about 43 minutes of allowed downtime. The same target over 7 days gives you about 10 minutes. Choose a window that matches how fast your team can actually detect and respond to problems. If your mean time to detect is 20 minutes, a 7-day window at 99.9% is a trap.
Error budgets: the part everyone gets wrong
The math is simple:
error_budget = 1 - SLO_target
A 99.9% target over 30 days means you can tolerate roughly 43 minutes of downtime. A 99.5% target gives you about 3.6 hours. These numbers aren’t interesting by themselves. What makes them powerful is the policy.
Budget healthy (> 50% remaining): Ship normally. Take calculated risks. Run that migration you’ve been planning.
Budget tight (10-50% remaining): Slow down releases. Require extra review on risky changes. Maybe skip the experimental feature flag rollout this week.
Budget burned (< 10% remaining): Stop feature work. The entire team focuses on reliability until the budget recovers.
That third state is where most teams fail. They write the policy, then when the budget actually burns, some product manager argues that the feature is too important to delay. If leadership won’t enforce the budget policy, you don’t have SLOs. You have aspirations.
Track burn rate, not just remaining budget
A single bad deployment can eat your monthly budget in an hour. By the time you notice the remaining budget is low, the damage is done.
burn_rate = errors_in_window / budget_for_window
Alert on burn rate. If you’re consuming budget at 10x the sustainable rate, you want to know in minutes, not at the Monday standup.
Keep it minimal
You don’t need an SLO for every endpoint. Pick the 3-5 user journeys that define whether your product is working. For most B2B SaaS, that’s: login, core workflow, data export, and billing. Everything else is noise at this stage.
Instrumentation comes first. An SLO is just a query on top of good metrics. If you don’t have request counts, status codes, and latency histograms, start there. A YAML definition can be as simple as:
slo:
name: checkout-availability
objective: 99.9
window: 28d
indicator: success_rate
filter: "route = /checkout AND source = production"
success: "status_code < 500 AND order_confirmed = true"
Your dashboard should answer three questions and nothing else: Are we meeting the SLO right now? How much budget is left? How fast are we burning it?
Proof it works
Here’s how you know if your SLOs are working: the last time your error budget got tight, did anything actually change? Did a release get delayed? Did someone shift from feature work to fixing that flaky dependency? Did the on-call rotation get extra support?
If the answer is no, go back to the error budget policy and make it real. Get sign-off from engineering leadership. Write it into your sprint process. Make the consequences automatic, not optional.
SLOs are a decision framework disguised as monitoring. The monitoring part is easy. The decision part is where most teams give up.
Don’t be most teams.