// Topic
Incident Management
Definition
Incident Management coverage in this archive spans 3 posts from Oct 2017 to Nov 2025 and frames incident management as continuous risk reduction instead of one-time policy work. The strongest adjacent threads are reliability, sre, and on call. Recurring title motifs include incident, ai, incidents, and like.
Working claims
- The strongest pattern is operational: security controls are effective only when they are embedded in delivery flow.
- The consistent theme from 2017 to 2025 is disciplined execution over hype cycles.
- This topic repeatedly intersects with reliability, sre, and on call, so design choices here rarely stand alone.
How to apply this
- Map threats to concrete controls, then tie each control to an owner and an observable signal.
- Start with the newest post to calibrate current constraints, then backtrack to older entries for first principles.
- When boundary questions appear, cross-read reliability and sre before committing implementation details.
Where teams get burned
- Treating compliance checklists as a substitute for runtime detection and response.
- Adding controls no one owns, tests, or rehearses under incident pressure.
- Applying guidance from 2017 to 2025 without revisiting assumptions as context changed.
Suggested reading path
- Start here (current state): AI Incidents Don’t Look Like Outages. That’s the Problem.
- Then read (operating middle): What a 3 AM Outage Taught Me About Incident Management
- Finish with (foundational context): Your Incident Process Will Break at 15 People. Here’s What to Do.
Related posts
- AI Incidents Don’t Look Like Outages. That’s the Problem.
- What a 3 AM Outage Taught Me About Incident Management
- Your Incident Process Will Break at 15 People. Here’s What to Do.
References
3 posts
- AI Incidents Don't Look Like Outages. That's the Problem.
Your AI system can return 200 OK and still be wrong, unsafe, or confidently hallucinating. Here's how to detect, contain, and learn from AI incidents -- drawing from the same IR principles that work for traditional systems.
What a 3 AM Outage Taught Me About Incident Management
Good incident response is not about preventing failure. It is about failing well. Lessons from a decade of on-call, including NATO and telecom-scale operations.
Your Incident Process Will Break at 15 People. Here's What to Do.
What I learned building incident management at the fintech startup — from five people shouting across a room to actual structured response.