Quick take
If your evaluation process is “I tried a few prompts and it seemed fine,” you don’t have evaluation. You have hope. Build a small test set, automate checks, monitor production, and block deploys that regress. It isn’t hard. It’s just work nobody wants to do.
I was on a call last month with a team. They had an AI-powered document analysis feature and wanted help figuring out why users were complaining about accuracy. My first question: “What does your evaluation suite look like?”
Silence. Then: “We test it manually before releases.”
That isn’t evaluation. That’s a prayer.
The core problem
LLMs are convincing even when they’re wrong. A hallucinated answer looks exactly like a correct one to someone who doesn’t already know the answer. This makes casual testing actively dangerous – it gives you false confidence.
The non-determinism makes it worse. Change one word in a system prompt and the behavior shifts in ways you can’t predict by reading the diff. The only way to know whether a change helped or hurt is to measure it against a stable reference.
What to actually measure
Not everything matters equally. I’ve seen teams build elaborate dashboards with dozens of metrics that nobody looks at. Start with the signals that map directly to user value.
| Signal | What it tells you | When it matters |
|---|---|---|
| Task success rate | Does the feature accomplish what users need? | Always |
| Format compliance | Can downstream systems parse the output? | Structured output, pipelines |
| Factual accuracy | Is the output correct? | Knowledge-heavy features |
| Safety compliance | Does the output follow policy? | User-facing, sensitive domains |
| Latency (p50/p95) | Is the feature fast enough? | Interactive features |
| Cost per task | Is this economically viable? | High-volume features |
Keep the list short. Four to six metrics is plenty. If you can’t explain why a metric is on the list, remove it.
Build a test set that looks like reality
This is where most teams cut corners, and it shows. A test set of five happy-path examples tells you nothing useful. You need cases that reflect the actual distribution of inputs your feature sees in production.
What a decent test set includes:
- Typical cases. The bread-and-butter inputs that make up 80% of traffic.
- Edge cases. Long inputs, short inputs, ambiguous inputs, inputs in unexpected formats.
- Known failure modes. Cases that broke in the past. These are gold.
- Adversarial inputs. Prompt injection attempts, confusing instructions, contradictory context.
Tag every case with a category. This prevents your overall score from hiding category-level failures. I’ve seen a system score 90% overall while completely failing on one important category because the other categories were easy.
Start with 30-50 cases. That’s enough to catch major regressions. Grow it as you learn.
The evaluation methods compared
There’s no single evaluation technique that works for everything. The right approach depends on what you’re measuring.
| Method | Speed | Consistency | Best for | Limitations |
|---|---|---|---|---|
| Exact match | Instant | Perfect | Structured output, classifications | Useless for open-ended tasks |
| Rule-based checks | Instant | Perfect | Format validation, required fields | Can’t judge quality or nuance |
| Model-as-judge | Fast | Good (but noisy) | Open-ended quality, tone, relevance | Needs calibration, can drift |
| Human review | Slow | Variable | Subjective quality, edge cases | Expensive, doesn’t scale |
| A/B testing (production) | Slow | Good (with volume) | Real-world impact | Requires traffic, slow feedback |
My recommendation: layer them. Use exact match and rule-based checks for everything you can. Use model-as-judge for quality on open-ended outputs, but calibrate it monthly against human reviewers. Reserve human review for cases where the automated signals disagree or when you’re exploring a new failure mode.
Offline vs. online: different jobs
This distinction matters more than most people realize.
Offline evaluation runs during development. It answers: “Did this prompt change improve behavior on known cases?” Run it before every deploy. Run it when you change prompts, retrieval logic, or model versions. It’s your regression gate.
Online evaluation runs in production. It answers: “Does this actually work for real users with real inputs?” Monitor task success, collect user signals (did they accept, edit, or reject the output?), and track drift over time.
| Aspect | Offline | Online |
|---|---|---|
| Purpose | Catch regressions | Validate real-world quality |
| Data source | Curated test set | Production traffic |
| Timing | Pre-deploy | Continuous |
| Feedback speed | Minutes | Hours to days |
| Blind spots | Can’t predict novel inputs | Hard to attribute cause |
You need both. A clean offline score without production monitoring is a false sense of security. I’ve personally seen features pass every offline test and fail in production because the test set didn’t represent the actual input distribution.
Operationalize it or it dies
Evaluation that lives in a notebook and runs when someone remembers isn’t evaluation. It’s a side project. Make it part of the delivery process.
The loop I use:
- Maintain a baseline. Your current production version’s scores on the test set. This is the bar.
- Run evals on every change. Prompt edits, model swaps, retrieval changes – all of it gets measured.
- Block deploys that regress. Not on every metric – pick the ones that matter and set thresholds.
- Refresh the test set. Add cases from production failures. Remove cases that no longer match product goals. Monthly is a good cadence.
- Review model-as-judge calibration. Monthly, have a human review a sample of the judge’s ratings. Adjust the grading prompt if it drifted.
The tooling to do this isn’t exotic. A script that runs your test set through the system, compares outputs to expected behavior, and produces a report. I’ve built these in a few hundred lines of Go. The hard part isn’t the code. It’s the discipline to actually run it every time.
The gap is discipline, not tooling
I keep coming back to this. The tools exist. The techniques are well-understood. The test sets aren’t that hard to build. What’s missing is the organizational willingness to treat AI output quality with the same rigor as test coverage or uptime.
If you wouldn’t ship a backend service without tests, you shouldn’t ship an AI feature without evaluation. Same principle. Same discipline. Different domain.
Build the test set. Automate the checks. Block the regressions. Everything else is details.