Here’s a scenario I’ve seen three times this year.
An AI-powered feature is in production. Uptime: 99.9%. Latency: nominal. Error rate: near zero. Dashboards are green. Everyone is happy.
Except the answers are wrong 15% of the time, and nobody knows because nothing is measuring answer quality. The system is healthy. The outputs are not.
This is the fundamental gap in AI observability. Traditional monitoring tells you whether the service is running. It does not tell you whether the service is useful.
Why AI systems fail silently
A classic API returns structured data. If the response is malformed, you get a parse error. If the logic is wrong, a test catches it. The failure modes are usually loud and obvious.
AI systems fail quietly. The model returns a perfectly formatted response with a confident tone and completely wrong content. The HTTP status is 200. The latency is fine. The JSON is valid. And the user just got told that their refund was processed when it wasn’t.
At a fintech startup, we had a similar problem with our financial news summarization pipeline, long before the current AI wave. The summaries looked plausible but occasionally attributed quotes to the wrong CEO or mixed up fiscal quarters. The system was “working” by every operational metric. The outputs were unreliable. We caught it only because a user complained, not because monitoring flagged it.
The lesson stuck with me. You can’t monitor AI like you monitor a REST API. You need different signals.
The signals that actually matter
I use a simple framework with five categories. If you are not tracking all five, you have blind spots.
Traceability. For every response, you need to know: which model, which prompt version, which retrieved context, which tool calls. If you can’t reconstruct why the model said what it said, you can’t debug a bad answer. You’re just guessing. I store a trace object alongside every response that includes model ID, prompt hash, retrieval IDs, and tool call logs. When something goes wrong, the trace is the first thing I pull.
Quality signals. This is the hard one. You need some measure of whether the output was good. Heuristic checks catch obvious failures: empty responses, responses that are too long or too short, and responses that contain known-bad patterns. Sampled evaluation catches the subtle failures: a human or a second model scores a random slice of outputs against a rubric. Neither is perfect. Together they cover enough ground.
Cost per outcome. Not cost per request, cost per successful outcome. A system that gets it right on the first try costs less than one that needs three retries and a human escalation. Track the full cost of getting to a good answer, including retries, fallbacks, and human review. This number will surprise you.
Safety and policy. Refusal rates, blocked content, policy trigger counts. If your refusal rate spikes, something changed – either the inputs or the model behavior. If it drops to zero, something might be wrong too. These are canary signals.
Operational basics. Latency percentiles by workflow (not globally – global averages hide everything), error rates with reason codes, token usage trends. The same stuff you track for any API, but broken down by the AI-specific dimensions that matter.
The prompt versioning problem
Here is something that bites almost every team. Someone changes a prompt. Quality drops. Nobody connects the two events because the prompt change was not tracked alongside the quality metrics.
Treat prompts as production code. Version them. Deploy them through your normal release process. Tag every response with the prompt version that produced it. When quality dips, the first question should be: what changed since the last known-good state?
I version prompts in the same repo as the service code. A prompt change gets a PR, a review, and a run against the eval suite before it hits production. It sounds like overkill until the first time it prevents a regression. Then it sounds obvious.
Keep it lean
The temptation is to build a dashboard for everything. Do not. Start with the minimum set of signals that lets you answer one question: “A user reported a bad answer. Can I explain why it happened and prevent it from happening again?”
If you can answer that question end-to-end, your observability is good enough. If you can’t, no amount of dashboards will save you.
Log the trace. Track quality. Version your prompts. Measure cost per outcome, not cost per request. That’s the baseline. Everything else is optimization.