What a 3 AM Outage Taught Me About Incident Management
Good incident response is not about preventing failure. It is about failing well. Lessons from a decade of on-call, including NATO and telecom-scale operations.
SRE coverage in this archive spans 8 posts from Oct 2017 to Nov 2021 and focuses on reliability, delivery speed, and cost discipline as one system, not three separate concerns. The strongest adjacent threads are reliability, devops, and incident management. Recurring title motifs include incident, sre, engineering, and outage.
Good incident response is not about preventing failure. It is about failing well. Lessons from a decade of on-call, including NATO and telecom-scale operations.
Opinionated take on SRE team models from someone who has seen them all fail in interesting ways.
Practical database reliability from running Postgres at the fintech startup and at large enterprises. Includes config examples, migration patterns, and the operational habits that actually prevent outages.
ODD sounds fancy. It's not. It means writing logs, metrics, and traces before you ship, not after your first outage.
Teams love saying they do chaos engineering. Few actually have hypotheses. Even fewer fix what they find.
Most SLOs are dashboards nobody acts on. Here's how to pick indicators that reflect real users, set targets grounded in data, and make error budgets actually change how your team ships.
The SRE hype train has everyone copying Google's playbook without asking whether it fits. Here's what actually matters when you're not running planet-scale infrastructure.
What I learned building incident management at the fintech startup — from five people shouting across a room to actual structured response.