I have a test I run when I walk into a new organization. I ask the SRE team lead: “When was the last time you gave a service back to a product team because they met their reliability bar?” If the answer is never, they don’t have an SRE team. They have an ops team with a trendy name.
SRE is fundamentally about incentives and ownership. Not tooling. Not Prometheus dashboards. The org structure you choose determines whether reliability work gets prioritized or gets dumped on a small group of people who burn out.
The Three Models That Actually Exist
Centralized SRE is one team supporting many services. It works early on when you’re establishing standards and building your first incident response muscle. I recommend it for companies under 200 engineers with shared infrastructure.
The failure mode is predictable. The SRE team becomes a ticket queue. Product teams throw things over the wall. SREs lose context on individual services. On-call burden grows faster than headcount. I’ve seen centralized SRE teams where every member was on call every other week. That isn’t sustainable. That’s a resignation factory.
Embedded SRE puts reliability engineers inside product teams. Strong domain knowledge. Clear ownership. I like this model for complex services where the operational profile is unique – think payment processing, real-time data pipelines, anything where the SRE needs to understand the business logic to debug production.
The failure mode here is fragmentation. Every team invents their own monitoring stack. Incident learnings don’t spread. SREs get treated as “the person who does the ops stuff” instead of a partner. And career growth gets weird because the SRE reports to a product manager who doesn’t understand their work.
Platform/Enablement SRE is the model I push most organizations toward, eventually. A central team builds reliability tooling, templates, and automation. Product teams own their own operations. The SRE team makes reliability easy instead of doing reliability for everyone.
This only works if the platform is good. If your self-service tooling is bad, product teams can’t actually own their operations and everything escalates back to the same five experts. I’ve watched this fail spectacularly at companies that tried to jump straight here without building the self-service layer first.
What I Actually Recommend
Most organizations over 300 engineers should run a hybrid. Central platform work for the shared layer. Embedded support for the three to five most critical services. Everything else gets the self-service platform and advisory engagement.
The key is having explicit engagement levels. Not every service deserves dedicated SRE attention. I use three tiers:
- Advisory: Architecture reviews and SLO guidance. Minimal operational involvement. This is for most services.
- Partner: Joint SLOs, shared incident response, regular reliability reviews. This is for important services with active reliability investment.
- Full support: Dedicated SRE ownership of production and on-call. This is for your revenue-critical, customer-facing, “if this breaks the CEO calls someone” services.
Scale engagement up as criticality increases. Scale it back as the product team matures. The goal is always to make the SRE team less needed, not more.
The Entry Criteria Nobody Wants to Set
Here is where I get unpopular. I tell organizations to require teams to meet a baseline before they get dedicated SRE support. Defined SLOs. Monitoring that reflects user impact. Up-to-date runbooks. Automated deployments with rollback. Participation in postmortems.
If a product team can’t be bothered to define an SLO, they aren’t ready for an SRE partner. They want someone to carry their pager. That’s a different thing.
Sizing
The “1 SRE per 10 developers” ratio gets thrown around a lot. It’s a fine starting point but it means nothing without context. A team of 10 developers running a single stateless API needs less SRE support than a team of 10 running a distributed database cluster.
What matters more: can you sustain your on-call rotations? You need at least 5-6 people per rotation to avoid burnout. If you can’t staff that, you either have too many services or too few SREs. Probably both.
The Anti-Patterns I See Everywhere
Renaming ops to SRE. Same people, same work, new title. Nobody adopted error budgets. Nobody does blameless postmortems. Nobody is writing automation to replace toil. Just a title change and a conference talk about “our SRE journey.”
SRE as deployment gatekeeper. If every deploy needs SRE approval, you’ve built a bottleneck and called it reliability. Deployments should be automated and safe by default. SRE shouldn’t be in the approval chain for routine changes.
SRE as permanent firefighter. If your SRE team spends 80% of their time on incident response and 20% on prevention, the math never gets better. Google’s original guidance was 50% cap on operational work. I’ve never seen an enterprise hit exactly 50%, but the principle is right. If there’s no time for prevention, you’re just managing decline.
The best SRE team I’ve worked with was at a mid-sized fintech. Six people. They supported the whole company not by owning every service, but by building such good tooling and templates that product teams could own their own reliability. The SRE team’s backlog was mostly platform improvements. Incidents were handled by the teams that owned the services. SREs showed up for the big ones and ran the postmortems.
That’s what good looks like. SRE as a leverage point, not a fire department.