SRE

Definition

SRE coverage in this archive spans 8 posts from Oct 2017 to Nov 2021 and focuses on reliability, delivery speed, and cost discipline as one system, not three separate concerns. The strongest adjacent threads are reliability, devops, and incident management. Recurring title motifs include incident, sre, engineering, and outage.

What the archive argues

Most posts prioritize predictable operations over feature breadth or stack novelty.
Early posts lean on incident and process, while newer posts lean on observability-driven and development as constraints shifted.
This topic repeatedly intersects with reliability, devops, and incident management, so design choices here rarely stand alone.

Execution checklist

Set SLOs first, then choose tooling that keeps deploy, observability, and rollback simple.
Start with the newest post to calibrate current constraints, then backtrack to older entries for first principles.
When boundary questions appear, cross-read reliability and devops before committing implementation details.

Common failure modes

Adding platform layers faster than the team can operate and debug them.
Chasing throughput gains without proving they improve end-user reliability.
Applying guidance from 2017 to 2021 without revisiting assumptions as context changed.

References

What a 3 AM Outage Taught Me About Incident Management

Nov 2021

Good incident response is not about preventing failure. It is about failing well. Lessons from a decade of on-call, including NATO and telecom-scale operations.

Stop Renaming Your Ops Team to SRE

Nov 2021

Opinionated take on SRE team models from someone who has seen them all fail in interesting ways.

Database Reliability Engineering: What I've Learned the Hard Way

Aug 2021

Practical database reliability from running Postgres at the fintech startup and at large enterprises. Includes config examples, migration patterns, and the operational habits that actually prevent outages.

Observability-Driven Development Is Just Instrumenting Your Code

Jun 2021

ODD sounds fancy. It's not. It means writing logs, metrics, and traces before you ship, not after your first outage.

Most Chaos Engineering Is Theater

Jun 2020

Teams love saying they do chaos engineering. Few actually have hypotheses. Even fewer fix what they find.

Your SLOs Are Probably Useless (Here's How to Fix Them)

May 2019

Most SLOs are dashboards nobody acts on. Here's how to pick indicators that reflect real users, set targets grounded in data, and make error budgets actually change how your team ships.

SRE Principles Are Great. The Cargo-Culting Is Not.

Apr 2018

The SRE hype train has everyone copying Google's playbook without asking whether it fits. Here's what actually matters when you're not running planet-scale infrastructure.

Your Incident Process Will Break at 15 People. Here's What to Do.

Oct 2017

What I learned building incident management at the fintech startup — from five people shouting across a room to actual structured response.

SRE