// Topics / Incident Management

Incident Management

Definition

Incident Management coverage in this archive spans 3 posts from Oct 2017 to Nov 2025 and frames incident management as continuous risk reduction instead of one-time policy work. The strongest adjacent threads are reliability, sre, and on call. Recurring title motifs include incident, ai, incidents, and like.

Working claims

The strongest pattern is operational: security controls are effective only when they are embedded in delivery flow.
The consistent theme from 2017 to 2025 is disciplined execution over hype cycles.
This topic repeatedly intersects with reliability, sre, and on call, so design choices here rarely stand alone.

How to apply this

Map threats to concrete controls, then tie each control to an owner and an observable signal.
Start with the newest post to calibrate current constraints, then backtrack to older entries for first principles.
When boundary questions appear, cross-read reliability and sre before committing implementation details.

Where teams get burned

Treating compliance checklists as a substitute for runtime detection and response.
Adding controls no one owns, tests, or rehearses under incident pressure.
Applying guidance from 2017 to 2025 without revisiting assumptions as context changed.

Suggested reading path

Start here (current state): AI Incidents Don’t Look Like Outages. That’s the Problem.
Then read (operating middle): What a 3 AM Outage Taught Me About Incident Management
Finish with (foundational context): Your Incident Process Will Break at 15 People. Here’s What to Do.

References

10 entries tagged “Incident Management”

De-Risking the Black Swan: Red-Teaming Distributed Databases Before Production March 16, 2026 · 8 min Red-teaming distributed databases before production: most catastrophic failures are compound scenarios nobody practiced, not black swans. distributed-systems databases resilience

AI Incidents Don't Look Like Outages. That's the Problem. November 10, 2025 · 4 min AI systems can return 200 OK while confidently wrong. How to detect, contain, and learn from AI incidents using proven incident response principles. incident-management ai reliability

What Log4j Actually Taught Us January 10, 2022 · 5 min Log4j wasn't a dependency problem. It was an operational readiness problem. Here's what to fix before the next one hits. security log4j dependencies

Log4j Is on Fire. Here's What to Do Right Now. December 13, 2021 · 5 min CVE-2021-44228 is the worst vulnerability I have seen in a decade. If you run Java anywhere, stop reading the news and start inventorying. security log4j vulnerability

What a 3 AM Outage Taught Me About Incident Management November 29, 2021 · 6 min Good incident response is not about preventing failure. It is about failing well. Lessons from a decade of on-call, including national cyber-defense and telecom-scale operations. incident-management sre on-call

SolarWinds Got Owned. Your Build Pipeline Might Be Next. December 14, 2020 · 5 min The SolarWinds supply-chain compromise is the wake-up call every software team needed. What happened, why it matters, and what you should do right now. security supply-chain solarwinds

Your Incident Response Plan Is Useless Until Someone Bleeds July 15, 2019 · 7 min Most incident response plans are shelf-ware. What actually matters when your infrastructure is on fire, drawn from real breaches and national cyber-defense exercises. security incident-management devops

Your Incident Process Will Break at 15 People. Here's What to Do. October 23, 2017 · 5 min What I learned building incident management at the fintech startup — from five people shouting across a room to actual structured response. incident-management devops on-call

WannaCry Hit. Here's What It Actually Exposed. May 15, 2017 · 4 min WannaCry wasn't sophisticated -- a known exploit with a patch already out. The real failure was organizational, and most companies are still making it. security ransomware incident-management

Security Incident Response for Startups May 23, 2016 · 9 min A practical incident response playbook for small teams: define incidents, assign owners, contain fast, investigate calmly, and recover with clear communication. security incident-management startups