3 AM on a Tuesday. Our primary data pipeline is down. Financial news is stale. Institutional clients are waking up in London expecting fresh data and getting yesterday’s numbers. I’m on a call with two engineers, both talking over each other, both working on different theories, neither aware of what the other has already tried. A third person pushes a fix to production that conflicts with what engineer number two is doing. The pipeline stays down for another forty minutes.
That incident cost us a client conversation I never want to have again.
At the fintech startup, we were five people for a long time. Incident response was simple — someone noticed a problem, yelled about it, and whoever was closest fixed it. Worked great. Then we grew. More services, more people, more time zones. That informal model didn’t scale. It just produced chaos with more participants.
Roles Kill Ambiguity
The single biggest improvement we made was assigning roles during incidents. Not permanently — per incident.
Incident commander. One person owns coordination. They decide severity, pull in the right people, and keep things moving. They don’t debug. They run the show.
Technical lead. The person actually investigating and fixing the problem. They talk to the incident commander, not to Slack at large.
Comms lead. Someone handles customer updates and internal stakeholders. Separating this from the technical work was huge for us. Engineers hate writing status updates mid-crisis, and it shows.
Scribe. Captures the timeline. What happened when, what was tried, what worked. Without this, your postmortem is fiction.
For smaller incidents, one person wears multiple hats. Fine. But for anything SEV-1, staff every role. Context switching during a crisis is where mistakes compound.
Severity Levels Are a Decision Framework
We use four levels. SEV-1 is full outage or data loss — all hands, immediate external comms. SEV-2 is partial degradation — on-call team owns it with escalation paths. SEV-3 hits a subset of users — on-call handles it. SEV-4 goes through normal workflow.
The point isn’t the numbering. It’s that everyone agrees on what warrants waking someone up at 3 AM versus filing a ticket. Without that agreement, you get two failure modes: either every alert is treated as a fire drill, or real emergencies get a slow response because the team is desensitized.
On-Call That Doesn’t Destroy People
I’ve seen on-call rotations burn out good engineers in months. We run week-long rotations with explicit handoffs and a secondary on-call for escalation. The handoff is a real conversation, not a calendar entry.
Three rules we follow:
Response time expectations are written down. Not “ASAP” — actual numbers. Acknowledge within ten minutes, active investigation within thirty.
Noisy alerts get fixed or deleted. Every alert that wakes someone up without being actionable erodes trust in the system. We review alert volume monthly.
Compensation exists. Time off after a rough night. No one should eat the cost of a 3 AM page out of team loyalty.
The First Five Minutes
When a page fires, the response should be boring. Acknowledge it. Assess severity. Open a dedicated channel. Page the people who need to be there. Start a timeline with what you know, even if what you know is almost nothing.
This sounds obvious. It isn’t. Without a checklist, people skip steps when adrenaline kicks in. They jump straight to debugging and forget to tell anyone what’s happening.
Mitigation Before Root Cause
This one took me too long to learn. When something is broken, your first job is to stop the bleeding. Roll back the deploy. Restart the service. Disable the feature. Redirect traffic. Whatever makes the user impact stop.
Root cause analysis is important. It’s also a luxury you earn after users stop being affected. I’ve watched engineers spend thirty minutes debugging while customers churned. Fix it first, understand it later.
Postmortems That Actually Change Things
We do blameless postmortems for every significant incident. Timeline, impact, root cause, contributing factors, action items. The blameless part matters — if people think they’ll get punished, they’ll hide what actually happened, and your postmortem becomes useless.
But here’s what I care about more than the postmortem document: action items that ship. Every action item gets an owner and a deadline. We review them in our regular engineering meetings. An item is closed when the change is deployed, not when someone says they’ll get to it.
I can’t overstate this. If your postmortem action items rot in a backlog, you will have the same incident again. Guaranteed.
Patterns Over Individual Fires
Track your incidents. Frequency by service, time to detect, time to resolve, on-call load per team. Individual incidents teach you something. Patterns teach you where to invest. If one service generates 60% of your pages, that’s not an on-call problem — it’s an engineering priority problem.
Pick Tools Your Team Will Use
I don’t care which paging system you run. PagerDuty, Opsgenie, whatever. I care that your team actually uses it consistently. Same for your status page, your incident channels, your postmortem docs. The best incident management tool is the one nobody has to be reminded to check.
We keep it simple at the fintech startup. Dedicated Slack channels per incident, a status page for external comms, and a shared doc template for postmortems. Nothing fancy. All of it used every time.
The Real Lesson
Incident management isn’t about preventing all failures. Systems break. What separates good teams from drowning teams is predictability in the response. Everyone knows their role, everyone knows the severity, communication flows to the right places, and the things you learn actually make it into production. That’s it. No magic.