LLM Evaluation: Stop Shipping on Vibes

Quick take

If your evaluation process is “I tried a few prompts and it seemed fine,” you don’t have evaluation. You have hope. Build a small test set, automate checks, monitor production, and block deploys that regress. It isn’t hard. It’s just work nobody wants to do.

I was on a call last month with a team. They had an AI-powered document analysis feature and wanted help figuring out why users were complaining about accuracy. My first question: “What does your evaluation suite look like?”

Silence. Then: “We test it manually before releases.”

That isn’t evaluation. That’s a prayer.

The core problem

LLMs are convincing even when they’re wrong. A hallucinated answer looks exactly like a correct one to someone who doesn’t already know the answer. This makes casual testing actively dangerous – it gives you false confidence.

The non-determinism makes it worse. Change one word in a system prompt and the behavior shifts in ways you can’t predict by reading the diff. The only way to know whether a change helped or hurt is to measure it against a stable reference.

What to actually measure

Not everything matters equally. I’ve seen teams build elaborate dashboards with dozens of metrics that nobody looks at. Start with the signals that map directly to user value.

Signal	What it tells you	When it matters
Task success rate	Does the feature accomplish what users need?	Always
Format compliance	Can downstream systems parse the output?	Structured output, pipelines
Factual accuracy	Is the output correct?	Knowledge-heavy features
Safety compliance	Does the output follow policy?	User-facing, sensitive domains
Latency (p50/p95)	Is the feature fast enough?	Interactive features
Cost per task	Is this economically viable?	High-volume features

Keep the list short. Four to six metrics is plenty. If you can’t explain why a metric is on the list, remove it.

Build a test set that looks like reality

This is where most teams cut corners, and it shows. A test set of five happy-path examples tells you nothing useful. You need cases that reflect the actual distribution of inputs your feature sees in production.

What a decent test set includes:

Typical cases. The bread-and-butter inputs that make up 80% of traffic.
Edge cases. Long inputs, short inputs, ambiguous inputs, inputs in unexpected formats.
Known failure modes. Cases that broke in the past. These are gold.
Adversarial inputs. Prompt injection attempts, confusing instructions, contradictory context.

Tag every case with a category. This prevents your overall score from hiding category-level failures. I’ve seen a system score 90% overall while completely failing on one important category because the other categories were easy.

Start with 30-50 cases. That’s enough to catch major regressions. Grow it as you learn.

The evaluation methods compared

There’s no single evaluation technique that works for everything. The right approach depends on what you’re measuring.

Method	Speed	Consistency	Best for	Limitations
Exact match	Instant	Perfect	Structured output, classifications	Useless for open-ended tasks
Rule-based checks	Instant	Perfect	Format validation, required fields	Can’t judge quality or nuance
Model-as-judge	Fast	Good (but noisy)	Open-ended quality, tone, relevance	Needs calibration, can drift
Human review	Slow	Variable	Subjective quality, edge cases	Expensive, doesn’t scale
A/B testing (production)	Slow	Good (with volume)	Real-world impact	Requires traffic, slow feedback

My recommendation: layer them. Use exact match and rule-based checks for everything you can. Use model-as-judge for quality on open-ended outputs, but calibrate it monthly against human reviewers. Reserve human review for cases where the automated signals disagree or when you’re exploring a new failure mode.

Offline vs. online: different jobs

This distinction matters more than most people realize.

Offline evaluation runs during development. It answers: “Did this prompt change improve behavior on known cases?” Run it before every deploy. Run it when you change prompts, retrieval logic, or model versions. It’s your regression gate.

Online evaluation runs in production. It answers: “Does this actually work for real users with real inputs?” Monitor task success, collect user signals (did they accept, edit, or reject the output?), and track drift over time.

Aspect	Offline	Online
Purpose	Catch regressions	Validate real-world quality
Data source	Curated test set	Production traffic
Timing	Pre-deploy	Continuous
Feedback speed	Minutes	Hours to days
Blind spots	Can’t predict novel inputs	Hard to attribute cause

You need both. A clean offline score without production monitoring is a false sense of security. I’ve personally seen features pass every offline test and fail in production because the test set didn’t represent the actual input distribution.

Operationalize it or it dies

Evaluation that lives in a notebook and runs when someone remembers isn’t evaluation. It’s a side project. Make it part of the delivery process.

The loop I use:

Maintain a baseline. Your current production version’s scores on the test set. This is the bar.
Run evals on every change. Prompt edits, model swaps, retrieval changes – all of it gets measured.
Block deploys that regress. Not on every metric – pick the ones that matter and set thresholds.
Refresh the test set. Add cases from production failures. Remove cases that no longer match product goals. Monthly is a good cadence.
Review model-as-judge calibration. Monthly, have a human review a sample of the judge’s ratings. Adjust the grading prompt if it drifted.

The tooling to do this isn’t exotic. A script that runs your test set through the system, compares outputs to expected behavior, and produces a report. I’ve built these in a few hundred lines of Go. The hard part isn’t the code. It’s the discipline to actually run it every time.

The gap is discipline, not tooling

I keep coming back to this. The tools exist. The techniques are well-understood. The test sets aren’t that hard to build. What’s missing is the organizational willingness to treat AI output quality with the same rigor as test coverage or uptime.

If you wouldn’t ship a backend service without tests, you shouldn’t ship an AI feature without evaluation. Same principle. Same discipline. Different domain.

Build the test set. Automate the checks. Block the regressions. Everything else is details.