Testing AI Where It Actually Runs

Quick take

Your eval suite passes. Your staging environment looks good. Your AI feature will still break in production because real users do things your test set never imagined. Shadow it, canary it, measure it, and make every rollout reversible. Evidence before confidence.

I wrote about testing in production back in 2019. The core thesis hasn’t changed: staging lies to you. What has changed is that AI makes the lying worse.

Traditional software either works or it doesn’t. The test passes or fails. The API returns the right data or throws an error. AI features exist in a gray zone where the output is almost always plausible, sometimes correct, and occasionally dangerous. Your test suite can’t cover this space. Production can.

Why offline evals aren’t enough

Every AI project should have an eval suite. I’ve been saying this for over a year. But evals test known scenarios. Production surfaces the unknown ones.

Real users send inputs your test set never imagined. They misspell things. They paste in multi-language text. They include personally identifiable information that triggers different model behavior. They ask questions that are ambiguous in ways your eval prompts aren’t.

At one company, their AI support agent passed every eval with flying colors. In production, users started treating it like a search engine – pasting in order numbers and expecting it to look up status. The model happily hallucinated order details instead of saying “I can’t do that.” The eval suite had no test case for “user treats chatbot like a database query tool.” Production found it in the first hour.

Shadow mode first

Before any AI change touches a real user, shadow it. Run the new version in parallel with the current one, compare outputs, and log everything. The user only sees the current version.

Here’s the pattern I use in Go:

type ShadowRunner struct {
	current   ModelClient
	candidate ModelClient
	logger    *ShadowLogger
}

func (s *ShadowRunner) Execute(ctx context.Context, req Request) (Response, error) {
	// Current model serves the user
	resp, err := s.current.Complete(ctx, req)

	// Candidate runs in background -- never blocks the user
	go func() {
		candidateCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
		defer cancel()

		candidateResp, candidateErr := s.candidate.Complete(candidateCtx, req)
		s.logger.LogComparison(ShadowResult{
			RequestID:      req.ID,
			CurrentOutput:  resp,
			CandidateOutput: candidateResp,
			CandidateErr:   candidateErr,
			Match:          s.compareOutputs(resp, candidateResp),
		})
	}()

	return resp, err
}

The shadow logger captures every comparison. I review divergences daily during the shadow period. If the candidate produces different outputs, I want to understand whether those differences are improvements, regressions, or neutral changes.

The shadow period should last at least a week. Longer for high-traffic services. The goal is to see enough real-world input diversity to have confidence in the change.

Canary with kill switches

Once shadow results look good, move to a canary deployment. Route a small percentage of real traffic to the new version and monitor closely.

type CanaryRouter struct {
	current     ModelClient
	candidate   ModelClient
	percentage  atomic.Int32
	qualityGate *QualityGate
}

func (c *CanaryRouter) Route(ctx context.Context, req Request) (Response, error) {
	if c.shouldCanary(req.UserID) {
		resp, err := c.candidate.Complete(ctx, req)
		if err != nil || !c.qualityGate.Check(resp) {
			// Automatic fallback to current
			return c.current.Complete(ctx, req)
		}
		return resp, err
	}
	return c.current.Complete(ctx, req)
}

func (c *CanaryRouter) shouldCanary(userID string) bool {
	hash := fnv.New32a()
	hash.Write([]byte(userID))
	return int(hash.Sum32()%100) < int(c.percentage.Load())
}

The QualityGate is the part most teams skip. It checks the candidate response against basic quality criteria before serving it. If the response fails the gate, the user gets the current version transparently. No harm done.

I start at 1%. Watch for a day. If quality signals hold, move to 5%. Then 25%. Then 100%. Each step gets at least a few hours of observation. If anything looks off at any step, roll back to the previous percentage. No drama.

The hash-based routing is important: the same user always gets the same version within a rollout step. This prevents confusing experiences where the same user gets different quality outputs on consecutive requests.

What to measure during rollout

Three categories of signals, checked at every rollout step:

Quality signals. Task success rate on your eval set. But also: user re-prompts (did they have to ask again?), abandonment rate (did they give up?), explicit negative feedback. These are the signals your eval suite can’t give you.

Safety signals. Refusal rate. Policy trigger count. Anything flagged by your content filters. If the candidate model refuses more or fewer requests than the current one, investigate before expanding.

Operational signals. Latency p50 and p95 by workflow. Token usage. Cost per request. Error rates. A model change that improves quality but doubles cost might not be a net win. Make that trade-off explicit.

type RolloutMetrics struct {
	Version         string
	QualityScore    float64
	RefusalRate     float64
	P50Latency      time.Duration
	P95Latency      time.Duration
	CostPerRequest  float64
	ErrorRate       float64
	UserRepromptRate float64
}

func (m *RolloutMetrics) PassesGate(baseline RolloutMetrics) bool {
	if m.QualityScore < baseline.QualityScore*0.95 {
		return false // quality regression > 5%
	}
	if m.ErrorRate > baseline.ErrorRate*1.5 {
		return false // error rate increase > 50%
	}
	if m.P95Latency > baseline.P95Latency*2 {
		return false // latency doubled
	}
	return true
}

These thresholds aren’t magic numbers. They’re product decisions. A 5% quality regression might be acceptable if cost drops by 40%. A latency doubling might be fine for a background task but fatal for a chat interface. Define them before the rollout starts, not during.

The one-change rule

Never change the model and the prompt at the same time. If quality drops, you won’t know which change caused it. This sounds obvious. I’ve watched four different teams make this mistake in the last three months.

Ship the prompt change. Measure. Ship the model change. Measure. If you must change both, do the prompt first because it’s cheaper to roll back.

Same goes for retrieval changes, system message changes, and tool configuration changes. One variable at a time. Anything else is debugging in the dark.

Holdout baselines

Keep a small, stable slice of traffic permanently on a known-good version. This is your holdout. It tells you whether quality changes are due to your changes or due to shifts in user behavior, input distribution, or upstream data.

Without a holdout, slow regressions look like normal variance. You won’t notice a 2% quality drop per week because no individual week looks bad. But your holdout will show the cumulative drift loud and clear.

What matters

Testing AI in production isn’t reckless. Shipping AI without testing it in production is reckless. Offline evals give you a baseline. Shadow mode gives you confidence. Canaries give you safety. Holdouts give you ground truth.

Every rollout should be reversible, measurable, and attributable to a single change. That isn’t a testing philosophy. That’s engineering discipline applied to a system that fails in ways your test suite can’t anticipate.