AI Incidents Don't Look Like Outages. That's the Problem.

| 4 min read |
incident-management ai reliability production

Your AI system can return 200 OK and still be wrong, unsafe, or confidently hallucinating. Here's how to detect, contain, and learn from AI incidents -- drawing from the same IR principles that work for traditional systems.

Quick take

AI incidents are behavior failures, not downtime. Your monitoring says everything is green while the system confidently gives wrong answers. Detect with sampled quality checks and user feedback. Contain with rollbacks and feature flags, not root-cause analysis. Turn every incident into new eval coverage. Speed and reversibility beat thoroughness.


I wrote about incident response in 2019, drawing from NATO cyber exercises and real startup breaches. The core lesson was simple: teams that perform best under pressure are the ones that have practiced the response, not the ones with the fanciest playbook sitting in Confluence.

That lesson applies directly to AI systems. But AI incidents have a nasty twist.

The system is up. The system is wrong.

Traditional incidents are usually obvious. The service is down. Latency spikes. Error rates climb. Dashboards go red. Someone gets paged.

AI incidents are subtle. The service returns 200 OK. Latency is normal. No errors in the logs. But the system is confidently telling a customer something wrong. Or it regressed after an untracked prompt change. Or the retrieval layer is surfacing stale docs, and the model is synthesizing them into plausible-sounding garbage.

I’ve seen this firsthand. A team ships a model update on Friday. Quality degrades on a specific input class. Nobody notices until Monday because all the operational metrics look fine. The only signal was a spike in user thumbs-down feedback that nobody was monitoring.

That’s the core problem. Your existing monitoring was built for availability. AI incidents are about correctness, and correctness is harder to observe.

What counts as an AI incident

Any material deviation from expected behavior that can affect users or business outcomes. In practice:

  • Wrong-but-plausible responses that users might trust and act on
  • Regressions after model, prompt, or retrieval changes
  • Retrieval failures that surface irrelevant or outdated context
  • Safety or policy violations – the model doing something it shouldn’t

These are ambiguous by nature. There’s no clean threshold. So detection has to rely on multiple signals, not a single metric.

Detection that actually works

Teams that catch things quickly combine several layers:

Sampled quality checks. Automatically evaluate a percentage of live traffic against your eval criteria. This catches systematic regressions before they pile up.

Targeted evals for known risk areas. If your system handles financial data or medical information, run focused checks on those categories continuously.

User feedback with low friction. A thumbs-down button isn’t sophisticated. It’s incredibly effective if someone is actually looking at the data. At a startup I ran, we learned that a simple feedback signal, reviewed daily, caught issues faster than any automated check.

Drift indicators. Track model behavior distributions over time. Track retrieval relevance scores. When these shift, something changed – even if nobody deployed anything.

No single signal is ground truth. The goal is to surface a pattern early enough to contain it.

Containment: fast and reversible

The instinct during any incident is to understand what happened. Resist that. Contain first, investigate later. This is the same principle from traditional IR – the tourniquet analogy I’ve used before.

For AI systems, the most reliable containment actions are:

  • Roll back to a previous model or prompt version. This requires having versioned those artifacts in the first place.
  • Feature-flag the risky path. Disable or rate-limit the AI feature. Route to a fallback.
  • Escalate to human review. For high-stakes outputs, insert a human checkpoint until the issue is understood.
  • Increase sampling. Crank up monitoring on the affected workflow while the issue is active.

All of these are operational actions, not analytical ones. You don’t need to understand the root cause to stop the bleeding.

Postmortems that close the loop

Once contained, run a focused postmortem. The questions are specific:

  • Which outputs were wrong or unsafe? Get concrete examples.
  • What signal could have caught this earlier?
  • What evaluation gap allowed it through?
  • What operational control would have reduced the blast radius?

The most important action item from any AI postmortem: add the failure cases to your eval suite. Every incident should produce new test coverage. If your eval suite isn’t growing after incidents, you aren’t learning.

Keep action items small and testable. “Improve quality” isn’t an action item. “Add 10 regression cases from this incident to the eval suite and enforce a rollout gate for prompt changes in this workflow” is an action item.

Prevention is a posture, not a gate

The teams that handle AI incidents well treat them as routine. Not as emergencies that mean someone failed. Practical prevention:

  • Evaluate changes before they hit full traffic. Canary deploys work for AI too.
  • Track model, prompt, and retrieval changes in a single changelog. When something breaks, you need to know what changed.
  • Maintain a simple runbook with containment options and owners. Not a 40-page document. A one-pager with “who gets paged, what can we roll back, what is the fallback.”

The goal isn’t zero incidents. The goal is fast detection, fast containment, and a system that gets more predictable over time. Same as any production system.