// Topics / SRE

SRE

Definition

SRE coverage in this archive spans 8 posts from Oct 2017 to Nov 2021 and focuses on reliability, delivery speed, and cost discipline as one system, not three separate concerns. The strongest adjacent threads are reliability, devops, and incident management. Recurring title motifs include incident, sre, engineering, and outage.

What the archive argues

Most posts prioritize predictable operations over feature breadth or stack novelty.
Early posts lean on incident and process, while newer posts lean on observability-driven and development as constraints shifted.
This topic repeatedly intersects with reliability, devops, and incident management, so design choices here rarely stand alone.

Execution checklist

Set SLOs first, then choose tooling that keeps deploy, observability, and rollback simple.
Start with the newest post to calibrate current constraints, then backtrack to older entries for first principles.
When boundary questions appear, cross-read reliability and devops before committing implementation details.

Common failure modes

Adding platform layers faster than the team can operate and debug them.
Chasing throughput gains without proving they improve end-user reliability.
Applying guidance from 2017 to 2021 without revisiting assumptions as context changed.

Suggested reading path

Start here (current state): What a 3 AM Outage Taught Me About Incident Management
Then read (operating middle): Most Chaos Engineering Is Theater
Finish with (foundational context): Your Incident Process Will Break at 15 People. Here’s What to Do.

References

8 entries tagged “SRE”

What a 3 AM Outage Taught Me About Incident Management November 29, 2021 · 6 min Good incident response is not about preventing failure. It is about failing well. Lessons from a decade of on-call, including national cyber-defense and telecom-scale operations. incident-management sre on-call

Stop Renaming Your Ops Team to SRE November 8, 2021 · 5 min Opinionated take on SRE team models from someone who has seen them all fail in interesting ways. sre teams organization

Database Reliability Engineering: What I've Learned the Hard Way August 9, 2021 · 7 min Practical database reliability from running Postgres in production: configs, safe migration patterns, and the operational habits that prevent outages. databases reliability sre

Observability-Driven Development Is Just Instrumenting Your Code June 14, 2021 · 4 min ODD sounds fancy. It's not. It means writing logs, metrics, and traces before you ship, not after your first outage. observability monitoring development

Most Chaos Engineering Is Theater June 8, 2020 · 3 min Teams love saying they do chaos engineering. Few actually have hypotheses. Even fewer fix what they find. chaos-engineering reliability sre

Your SLOs Are Probably Useless (Here's How to Fix Them) May 20, 2019 · 6 min Most SLOs are dashboards nobody acts on. Pick indicators that reflect real users, set targets from data, and make error budgets change how your team ships. sre slo reliability

SRE Principles Are Great. The Cargo-Culting Is Not. April 30, 2018 · 5 min The SRE hype train has everyone copying Google's playbook without asking whether it fits. What actually matters when you're not running at planet scale. sre devops reliability

Your Incident Process Will Break at 15 People. Here's What to Do. October 23, 2017 · 5 min What I learned building incident management at the fintech startup — from five people shouting across a room to actual structured response. incident-management devops on-call