Your Incident Response Plan Is Useless Until Someone Bleeds

| 7 min read |
security incident-response devops startups

Most incident response plans are shelf-ware. Here's what actually matters when your infrastructure is on fire -- drawn from real breaches, NATO cyber exercises, and startup chaos.

Quick take

Nobody reads the 40-page incident response PDF until the breach is already happening. Build muscle memory instead. Run drills. Assign roles before the adrenaline hits. The plan that survives first contact is the one your team has rehearsed, not the one sitting in Confluence.


I’ve sat in rooms where people were genuinely unsure whether they were being breached or just seeing noisy logs. That moment of paralysis – where nobody knows who decides, who calls whom, who touches what – is where the real damage happens. Not the exploit itself. The confusion.

This post is about building the kind of incident response capability that actually functions when things go sideways. Not a framework you can admire from a distance. A process you can execute at 2am with half your team asleep.

The problem with most IR plans

Every company I’ve worked with has some version of an incident response document. At the fintech startup, we had one. At Dropbyke, we had one. They were fine documents. Well-structured. Thorough. And completely useless the first time something real happened.

The issue isn’t the content. The issue is that nobody has practiced it. An IR plan is like a fire drill – the value isn’t in the map of the exits, it’s in the fact that everyone has walked the route.

During a NATO cyber defense exercise I participated in, the teams that performed best weren’t the ones with the most sophisticated tooling. They were the ones where every person knew their role cold. The analyst knew to preserve logs before touching anything. The incident commander knew to start a timeline before asking questions. The comms lead had templates ready. Muscle memory.

Severity: stop overthinking it

I’ve seen teams spend 20 minutes debating whether something is a SEV-2 or SEV-3 while the attacker is still active. Here is a simple rule that works for startups and mid-size teams:

SEV-1: You know or strongly suspect data has been accessed by someone who shouldn’t have it. Customer data, credentials, financial records. Page everyone.

SEV-2: Something is clearly wrong – unauthorized access, exploitation of a known vuln, weird lateral movement – but you don’t know the scope yet. Page the security lead and an incident commander.

SEV-3: Suspicious activity that needs investigation but shows no evidence of compromise. Next business hours is fine.

That’s it. Three levels. The distinction between 1 and 2 is “do we know what they got?” If yes, SEV-1. If no, SEV-2. Stop debating and start working.

The first 30 minutes decide everything

I have a strong opinion on this: the first 30 minutes of an incident determine whether you spend three days cleaning up or three weeks. Here is what has to happen in that window:

Minute 0-5: Acknowledge and assign. Someone claims the incident commander role. Not “can someone look at this?” – an explicit “I’m IC for this.” Open a dedicated channel. Name it something obvious like incident-2019-07-15.

Minute 5-15: Triage and preserve. Before you touch anything, snapshot what you can. Disk images, memory dumps, log exports. At Decloud, we learned this the hard way when someone rebooted a compromised container to “fix it” and wiped the volatile evidence we needed. Never again.

The key questions at this stage:

  • Is the attacker still active right now?
  • What systems are potentially in scope?
  • What is the worst-case data exposure?

Minute 15-30: Contain. Isolate affected hosts. Revoke compromised credentials. Block known-bad IPs. Kill active sessions from suspicious accounts. The goal isn’t to understand everything yet – it’s to stop the bleeding.

A useful mental model: containment is a tourniquet. You apply it fast and ugly. Eradication is surgery. That comes later.

Roles that actually matter

Skip the org chart with 12 named roles. For a team under 50 engineers, you need four:

Incident Commander (IC): Owns the incident. Makes decisions when there’s disagreement. Controls the communication cadence. The IC doesn’t debug – they coordinate. The single hardest skill here is resisting the urge to jump into a terminal. I struggle with this every time.

Technical Lead: Drives the investigation. Validates containment. Recommends eradication steps. This person lives in the logs and the systems.

Scribe: Documents everything with timestamps. Decisions, actions, findings, who said what. This role feels bureaucratic until you need to write the post-mortem or talk to a regulator. Then it’s the most valuable person in the room.

Comms Lead: Handles internal updates and external messaging. Gets legal involved early if disclosure might be required. At a startup, this is often the CEO or co-founder. Fine. Just make sure they aren’t also trying to be the IC.

Investigation: follow the trail, not your assumptions

The biggest mistake in incident investigation is anchoring on the first hypothesis. “It was probably a leaked API key” becomes the only thing anyone investigates, while the actual entry point – a vulnerable dependency three services deep – goes unexamined.

Ask these questions systematically:

  • How did initial access occur?
  • What was the dwell time? (How long were they in before we noticed?)
  • Was there lateral movement?
  • What data was accessed, modified, or exfiltrated?

Where to look:

  • Application logs (request patterns, auth failures, unusual API calls)
  • Cloud audit logs (IAM changes, new resources, API activity)
  • Network flow data (unexpected outbound connections, data transfer volumes)
  • Auth logs (session creation, privilege escalation, MFA bypasses)

Build your timeline as you go. Not after. Reconstructing a timeline from memory 48 hours later is fiction writing, not forensics.

Evidence: the chain-of-custody problem

This is the part most startup engineers have never thought about. If your incident turns into a legal matter – and breaches involving customer data often do – you need evidence that holds up.

Practical rules:

  • Snapshot before you fix. Always.
  • Hash everything you collect. SHA-256 is fine.
  • Record who collected what, when, and where it’s stored.
  • Don’t modify originals. Work on copies.

A minimal evidence log looks like this:

evidence:
  item: api-server-memory.raw
  collected_by: law
  collected_at: 2019-07-15T10:45:00Z
  hash: sha256:a8f5f167f44f4964e6c998dee827110c
  storage: s3://incident-evidence/2019-07-15/
  notes: captured before container restart

Is this overkill for most startups? Maybe. But the one time you need it and don’t have it, you will wish you had spent the extra five minutes.

Communication: silence isn’t a strategy

Internal communication during an incident should follow a predictable cadence. SEV-1 gets updates every 30 minutes. SEV-2 every hour. Use a simple template:

Incident: checkout-auth-bypass
Severity: SEV-1
Status: containment
IC: law
Last update: 2019-07-15 11:00 UTC
Summary: Unauthorized API access via expired OAuth tokens.
         Revoked all active sessions. Investigating scope.
Next update: 11:30 UTC

External communication is harder and scarier. My advice: get legal involved before you send anything to customers. Not because lawyers make things better, but because disclosure obligations vary by jurisdiction and getting it wrong creates more problems than the breach itself. GDPR gives you 72 hours for supervisory authority notification. Some US state laws are shorter. Know your obligations before the incident, not during.

The post-mortem: where the real value lives

Run a blameless post-mortem within 48 hours. Not a week later. Not “when things calm down.” Memory degrades fast.

Structure it simply:

  1. Timeline: What happened, in order, with timestamps.
  2. Root cause: Not “human error.” The systemic reason. Why was it possible for a human to make that error?
  3. What worked: Reinforce the things that went right.
  4. What broke: Be honest about gaps.
  5. Action items: With owners and deadlines. No orphaned tasks.

The post-mortem document isn’t a punishment. It’s the mechanism by which you convert an expensive incident into durable organizational learning. If your team is afraid to write honest post-mortems, that’s a bigger problem than any security incident.

Preparation: the boring stuff that saves you

None of this works if you haven’t done the groundwork:

  • Run tabletop exercises quarterly. Describe a scenario. Walk through the response. Find the gaps before they find you. At EF, during the Decloud days, we would run these as part of our regular sprint rituals. 30 minutes. Low overhead. High value.
  • Test your backups. Not “verify they exist.” Actually restore from them. I’ve seen teams discover their backup pipeline had been silently failing for weeks.
  • Ensure logging is actually on. Audit trails, cloud API logs, application logs. If it isn’t being collected, it doesn’t exist during an incident.
  • Keep a contact list current. Who do you call at your cloud provider? Your legal counsel? Your insurance carrier? Your biggest customer’s security team? Write it down.

The uncomfortable truth

Most startups won’t invest seriously in incident response until after their first real breach. That’s human nature. But the gap between “we have a plan” and “we’ve practiced the plan” is the gap between a manageable incident and a company-threatening one.

Build the muscle memory. Run the drills. Assign the roles. The plan doesn’t need to be perfect. It needs to be practiced.