What a 3 AM Outage Taught Me About Incident Management

| 6 min read |
incident-management sre on-call postmortem

Good incident response is not about preventing failure. It is about failing well. Lessons from a decade of on-call, including NATO and telecom-scale operations.

It was 3:14 AM on a Tuesday when my phone went off. A payment processing service at a company I was working with had started returning 500s. By the time I opened my laptop, three different people were in a Slack channel typing theories. Nobody knew who was in charge. Nobody had checked the obvious things. Someone was already looking at database query plans for a problem that turned out to be a bad deploy from four hours earlier.

Forty-seven minutes of customer impact. The actual fix was a rollback that took ninety seconds.

That incident wasn’t a technology failure. It was a process failure. The system broke the way systems break. The response was where the real damage happened.

Detect User Pain, Not Internal Noise

I learned this working on NATO communication systems and reinforced it at every enterprise since. Your alerts should fire on symptoms that affect users: error rates, latency, failed transactions. Not CPU spikes. Not disk usage on a non-critical box. Not “pod restarted.”

Every page should pass a simple test: does the on-call person know what to do when they see this alert? If the answer is “look at the dashboard and figure it out,” the alert is bad. Delete it or rewrite it.

At one Verizon engagement, we cut alert volume by 60% in two weeks by applying this filter. The remaining alerts had higher signal. Response times dropped because engineers trusted the pages they received.

Detection sources worth having:

  • Monitoring alerts on user-facing symptoms
  • Synthetic checks that simulate real user flows
  • Error tracking with spike detection
  • Customer support escalations (slower but often the most precise signal)

Triage: Three Questions, Thirty Seconds

When the alert fires, you need answers fast: how bad is it, how wide is it, and is it getting worse?

I use four severity levels. The exact labels don’t matter. What matters is that everyone agrees on them before the incident happens.

  • SEV1: Widespread outage, data loss, or security breach. All hands.
  • SEV2: Major degradation. Limited workarounds. Paging the on-call team.
  • SEV3: Partial impact with a clear workaround. Business hours response.
  • SEV4: Cosmetic or internal-only. Fix it in the next sprint.

If you’re debating severity during the incident, you haven’t defined your levels well enough.

Assign Roles Immediately

The 3 AM incident I described? The biggest problem wasn’t the broken service. It was that nobody knew who was running the response. Three engineers investigating three different theories. No communication to stakeholders. No timeline being recorded.

Every incident needs at minimum:

  • Incident commander. Owns the response. Makes decisions. Doesn’t debug.
  • Technical lead. Drives investigation and proposes fixes.
  • Communications lead. Sends updates. Talks to stakeholders so the IC doesn’t have to.

For SEV1s, add a scribe to record the timeline. You’ll need it for the postmortem and you won’t remember what happened at 3:47 AM.

Mitigate First, Investigate Later

This is the hardest discipline to build. Engineers want to understand the root cause before they fix anything. I get it. The problem is that while you’re reading query plans, customers are losing money.

Mitigate first. Roll back the deploy. Disable the feature flag. Fail over to the secondary. Shed non-critical traffic. Get the system stable, then investigate.

The investigation can start in parallel, but the priority is always reducing user impact. Always.

The Runbook Template That Actually Gets Used

I’ve written a lot of runbooks. Most of them were too long and nobody read them during an actual incident. The ones that work fit on one screen and answer four questions.

Here is the template I use now:

# [Service Name] - [Problem Type]

## Symptoms
- What the alert looks like
- Key metrics to check (with dashboard links)

## Quick Verification
1. Check [specific endpoint/metric] to confirm the problem
2. Check [deployment history] for recent changes
3. Check [dependency status] for upstream issues

## Mitigation Steps
1. If recent deploy: rollback via [command/link]
2. If dependency failure: enable [fallback/circuit breaker]
3. If capacity issue: scale [component] to [target]

## Escalation
- Primary: [name/team] via [channel]
- Secondary: [name/team] via [channel]
- External vendor: [contact] for [specific dependency]

## Post-Mitigation
- Verify [user-facing metric] returns to baseline
- Send update to [channel]
- Open postmortem doc: [template link]

That’s it. Short enough to use at 3 AM with one eye open. Detailed enough to guide someone who has never seen this problem before.

Communication During the Incident

Under-communicating during an incident is almost always worse than over-communicating. Stakeholders fill silence with imagination, and their imagination is always worse than reality.

Send updates every 30 minutes during active incidents. Even if the update is “still investigating, no change.” Use a fixed format:

Status: Investigating / Mitigating / Monitoring
Impact: [Who is affected and how]
Current action: [What we are doing right now]
Next update: [Time]

No speculation. No blame. Facts and timeline.

Postmortems: The Part Everyone Skips

The postmortem isn’t the document. The postmortem is the action items that come out of the document.

I’ve read hundreds of postmortems that end with “improve monitoring” and “add more tests” and nothing ever happens. Good action items are specific, owned, and time-bound. “Add latency alerting to the checkout service with a 500ms p99 threshold, owned by Sarah, due by December 15.”

Track completion rate. If your action item completion rate is below 70%, your postmortem process is theater.

What a good postmortem includes:

  • Impact summary (duration, affected users, revenue impact if known)
  • Timeline with decision points
  • Root cause and contributing factors
  • What went well in the response
  • What slowed the response
  • Action items with specific owners and deadlines

The Habits That Compound

Incident management is a muscle. It gets stronger with use and atrophies without it.

  • Weekly incident review. 15 minutes. What happened, what are we learning, are action items shipping.
  • Game days. Quarterly at minimum. Simulate real failures. Find the gaps in your runbooks before the gaps find you.
  • On-call shadowing. New team members shadow for at least two rotations before going primary. This is non-negotiable.
  • Runbook updates. After every real incident, update the runbook for that service. If no runbook exists, write one.
  • Rollback testing. If you’ve never tested your rollback procedure, you don’t have a rollback procedure.

The best incident response I’ve ever seen wasn’t from a team with amazing tools. It was from a team that practiced. Every week. Every quarter. Every time something broke, they got a little better.

Incidents are inevitable. Bad response isn’t.