// Topic
SRE
Definition
SRE coverage in this archive spans 8 posts from Oct 2017 to Nov 2021 and focuses on reliability, delivery speed, and cost discipline as one system, not three separate concerns. The strongest adjacent threads are reliability, devops, and incident management. Recurring title motifs include incident, sre, engineering, and outage.
What the archive argues
- Most posts prioritize predictable operations over feature breadth or stack novelty.
- Early posts lean on incident and process, while newer posts lean on observability-driven and development as constraints shifted.
- This topic repeatedly intersects with reliability, devops, and incident management, so design choices here rarely stand alone.
Execution checklist
- Set SLOs first, then choose tooling that keeps deploy, observability, and rollback simple.
- Start with the newest post to calibrate current constraints, then backtrack to older entries for first principles.
- When boundary questions appear, cross-read reliability and devops before committing implementation details.
Common failure modes
- Adding platform layers faster than the team can operate and debug them.
- Chasing throughput gains without proving they improve end-user reliability.
- Applying guidance from 2017 to 2021 without revisiting assumptions as context changed.
Suggested reading path
- Start here (current state): What a 3 AM Outage Taught Me About Incident Management
- Then read (operating middle): Most Chaos Engineering Is Theater
- Finish with (foundational context): Your Incident Process Will Break at 15 People. Here’s What to Do.
Related posts
- What a 3 AM Outage Taught Me About Incident Management
- Stop Renaming Your Ops Team to SRE
- Database Reliability Engineering: What I’ve Learned the Hard Way
- Observability-Driven Development Is Just Instrumenting Your Code
- Most Chaos Engineering Is Theater
- Your SLOs Are Probably Useless (Here’s How to Fix Them)
- SRE Principles Are Great. The Cargo-Culting Is Not.
- Your Incident Process Will Break at 15 People. Here’s What to Do.
References
8 posts
- What a 3 AM Outage Taught Me About Incident Management
Good incident response is not about preventing failure. It is about failing well. Lessons from a decade of on-call, including NATO and telecom-scale operations.
Stop Renaming Your Ops Team to SRE
Opinionated take on SRE team models from someone who has seen them all fail in interesting ways.
Database Reliability Engineering: What I've Learned the Hard Way
Practical database reliability from running Postgres at the fintech startup and at large enterprises. Includes config examples, migration patterns, and the operational habits that actually prevent outages.
Observability-Driven Development Is Just Instrumenting Your Code
ODD sounds fancy. It's not. It means writing logs, metrics, and traces before you ship, not after your first outage.
Most Chaos Engineering Is Theater
Teams love saying they do chaos engineering. Few actually have hypotheses. Even fewer fix what they find.
Your SLOs Are Probably Useless (Here's How to Fix Them)
Most SLOs are dashboards nobody acts on. Here's how to pick indicators that reflect real users, set targets grounded in data, and make error budgets actually change how your team ships.
SRE Principles Are Great. The Cargo-Culting Is Not.
The SRE hype train has everyone copying Google's playbook without asking whether it fits. Here's what actually matters when you're not running planet-scale infrastructure.
Your Incident Process Will Break at 15 People. Here's What to Do.
What I learned building incident management at the fintech startup — from five people shouting across a room to actual structured response.