LLM Observability: Your Existing Monitoring Is Not Enough

Quick take

Treat LLM calls like you treat database calls in OpenTelemetry: trace them, measure them, and alert on quality drift – not just errors. Your Grafana dashboard showing 200 OK and low latency means nothing if the model is hallucinating.

Two weeks after we shipped the transaction categorization feature at a fintech company, OpenAI quietly changed something in their model inference. Our API calls still returned 200. Latency was normal. The logs looked clean. But accuracy dropped from 89% to 86% overnight, and the confidence score distribution shifted in a way that pushed more borderline results above our threshold.

We caught it because we had quality monitoring in place. If we’d been relying on traditional application monitoring – error rates, latency percentiles, throughput – we’d have missed it entirely. The system was “healthy” by every standard metric. It was just less correct.

This is the fundamental problem with LLM observability. The output can be fluent, well-formatted, and completely wrong. Your existing monitoring stack doesn’t know the difference.

What LLM observability actually means

For traditional services, observability is the RED metrics: Rate, Errors, Duration. For LLMs, you need those plus an entire quality dimension that doesn’t exist in normal software.

I think about it in three layers:

Infrastructure layer. The basics. API availability, latency (p50, p95, p99), error rates by type (rate limits, timeouts, 500s), token usage, cost. If you’re using OpenTelemetry – and you should be – this maps cleanly onto standard span attributes.

Model behavior layer. This is the new part. Output quality, confidence distributions, format compliance, fallback rates, and drift detection. These signals require domain-specific instrumentation, not just HTTP metrics.

Product outcome layer. User actions on model output: acceptance, edits, rejections, re-prompts. This is where you learn whether the model is actually helping or just producing plausible noise.

How we instrumented it

I’m a big OpenTelemetry advocate. We use it at a fintech company for everything, and extending it to LLM calls was a natural fit. The key insight: treat each LLM call as a span in your trace, with custom attributes for the things you need to measure.

Here’s the shape of what we log for every LLM call:

span.name: "llm.completion"
span.attributes:
  llm.model: "gpt-3.5-turbo-0613"
  llm.prompt_template: "tx-categorize-v3"
  llm.prompt_tokens: 847
  llm.completion_tokens: 124
  llm.total_tokens: 971
  llm.temperature: 0.0
  llm.confidence_score: 0.91
  llm.output_valid: true
  llm.fallback_used: false
  llm.cost_usd: 0.0015

The prompt template version is critical. Without it, you can’t correlate quality changes with prompt changes. We version our prompts like we version our API: tx-categorize-v3 tells us exactly which instructions the model received.

We also log a hashed version of the input (for cardinality analysis) and the full output (for sampling and review). The full input gets logged to a separate, access-controlled store because transaction data is sensitive. Don’t dump PII into your standard telemetry pipeline. I’ve seen teams do this. Don’t be that team.

The quality metrics that actually matter

After running this for a couple of months, here’s what I actually look at:

Accuracy on the eval set. We run our 200-transaction eval set daily against the live model. This is the canary. If this number drops, something changed – either in our code, our prompts, or the model itself. We alert at a 2-point drop sustained for two consecutive runs.

Confidence score distribution. Not the average – the distribution. A shift in the shape of the histogram tells you more than the mean. When the model gets less confident on average, it means the input distribution has changed or the model itself has drifted.

Fallback rate. What percentage of requests hit our rules-based fallback instead of using the model output? This is a compound signal: it reflects both model quality and input quality. A spike in fallback rate is always worth investigating.

User correction rate. How often do users change the model’s output? This is the ground truth. Automated metrics are approximations. User corrections are direct feedback. We track this weekly and use it to update our eval set.

Cost per successful categorization. Not cost per API call. Cost per result that the user accepted without correction. This metric keeps us honest about whether quality improvements are actually cost-effective.

Dashboards for different audiences

I built three dashboards. Each one serves a different question.

The ops dashboard shows error rates, latency, rate limit hits, and cost. This is for on-call. It answers: “Is the LLM integration working?” Standard SRE stuff, just with LLM-specific dimensions.

The quality dashboard shows accuracy trends, confidence distributions, fallback rates, and eval set results. This is for the engineering team. It answers: “Is the model producing good results?” This is the one we check every morning.

The product dashboard shows user correction rates, acceptance rates, and task completion metrics. This is for the PM. It answers: “Is this feature helping users?” This is the one that justifies the feature’s existence.

The temptation is to build one mega-dashboard. Resist it. Different audiences need different views, and combining them creates noise that ensures nobody looks at any of it.

Alerting without drowning

The biggest mistake I see: alerting on every metric at static thresholds. LLM behavior is inherently variable. You’ll page yourself into exhaustion.

What works better: alert on sustained deviations from a rolling baseline. Our eval accuracy fluctuates between 87% and 91% day to day. That’s normal. An alert at 85% absolute threshold would fire once and be ignored. An alert on “3+ consecutive days below the 7-day moving average minus 2 points” catches real regressions while ignoring noise.

For cost, we alert on day-over-day percentage increases above 50%. This catches both sudden spikes (runaway retry loops) and gradual creep (prompt bloat) depending on the time window.

The honest summary

LLM observability isn’t a new discipline. It’s regular observability plus quality measurement. If you’re already doing OpenTelemetry tracing, adding LLM spans is straightforward. If you’re already running eval sets, automating them on a daily cadence is a small step.

The hard part isn’t the tooling. The hard part is accepting that “the service is up” isn’t the same as “the service is working.” For LLMs, the gap between those two statements is where all the interesting failures live.

Monitor the quality. Version your prompts. Run your eval set daily. Everything else is optimization.