How I Actually Test LLM Features

Quick take

Test LLM features in layers: deterministic checks for everything around the model (parsing, validation, prompt rendering), property-based checks for model outputs (format, required fields, safety), and a curated golden set for regression detection. Don’t test exact string matches. Test the properties that matter to users.

The first time I shipped an LLM feature without a proper test suite, we spent three weeks arguing about whether the quality had regressed after a prompt change. Nobody had baseline numbers. Nobody had a definition of “good.” We were debugging by vibes.

Never again.

LLM testing is different from traditional software testing, but it isn’t impossible. It requires accepting that you’re testing probabilistic behavior and building your strategy around that reality instead of fighting it.

The problem with LLM outputs

Three things make LLM testing hard:

Non-determinism. The same input can produce different outputs across runs, even with temperature set to zero (some providers still have variance).
Multiple valid answers. For most tasks, there isn’t one correct answer. There’s a space of acceptable answers.
Invisible regressions. A prompt change or model update can shift behavior without any code change. Your CI pipeline sees green. Your users see worse outputs.

The instinct is to throw up your hands and say “we can’t test this.” That’s wrong. You can test this. You just can’t use assertEqual.

Layer 1: deterministic tests for everything around the model

The code around the LLM – prompt rendering, response parsing, validation, error handling – is deterministic. Test it like normal software.

func TestPromptRendering(t *testing.T) {
    tmpl := NewSupportPrompt()
    result, err := tmpl.Render(PromptInput{
        CustomerName: "Alice",
        Issue:        "billing dispute",
        History:      []string{"previous contact on 2024-07-15"},
    })
    if err != nil {
        t.Fatalf("render failed: %v", err)
    }

    if !strings.Contains(result, "Alice") {
        t.Error("prompt should contain customer name")
    }
    if !strings.Contains(result, "billing dispute") {
        t.Error("prompt should contain issue description")
    }
    if !strings.Contains(result, "2024-07-15") {
        t.Error("prompt should contain interaction history")
    }
}

func TestResponseParsing(t *testing.T) {
    raw := `{"action": "escalate", "reason": "billing dispute over $500", "priority": "high"}`

    resp, err := ParseSupportResponse([]byte(raw))
    if err != nil {
        t.Fatalf("parse failed: %v", err)
    }

    if resp.Action != "escalate" {
        t.Errorf("expected action=escalate, got %s", resp.Action)
    }
    if resp.Priority != "high" {
        t.Errorf("expected priority=high, got %s", resp.Priority)
    }
}

These tests are fast, stable, and catch a surprising number of regressions. I’ve seen parsing bugs slip through because teams only tested the happy path, then the model started returning JSON with trailing commas.

Also test mocked LLM responses to verify error handling and orchestration logic:

func TestHandlesModelTimeout(t *testing.T) {
    client := &MockLLMClient{
        Response: nil,
        Err:      context.DeadlineExceeded,
    }

    handler := NewSupportHandler(client)
    result, err := handler.Handle(context.Background(), "test query")

    if err != nil {
        t.Fatal("handler should not propagate model timeout as error")
    }
    if result.Fallback != true {
        t.Error("should trigger fallback on timeout")
    }
}

Layer 2: property-based checks for model outputs

You can’t check that the model said “I apologize for the inconvenience.” You can check that the response acknowledges the issue, avoids profanity, and stays under 200 words.

Define a rubric. Keep it simple.

type EvalCriteria struct {
    Name    string
    Check   func(input string, output string) bool
}

var supportResponseCriteria = []EvalCriteria{
    {
        Name: "acknowledges_issue",
        Check: func(input, output string) bool {
            lower := strings.ToLower(output)
            return strings.Contains(lower, "sorry") ||
                strings.Contains(lower, "understand") ||
                strings.Contains(lower, "apologize")
        },
    },
    {
        Name: "includes_next_steps",
        Check: func(input, output string) bool {
            lower := strings.ToLower(output)
            return strings.Contains(lower, "will") ||
                strings.Contains(lower, "next") ||
                strings.Contains(lower, "follow up")
        },
    },
    {
        Name: "reasonable_length",
        Check: func(input, output string) bool {
            words := strings.Fields(output)
            return len(words) >= 20 && len(words) <= 200
        },
    },
}

These aren’t perfect. The string matching is crude. But they catch common failure modes: responses that ignore the user’s problem, responses that are empty or absurdly long, and responses that miss expected elements.

For more nuanced checks – tone, factual accuracy, coherence – I use model-based evaluation. Have a separate evaluator model score the output against the rubric. It isn’t free, but it’s cheaper than human review on every test case and usually more reliable than regex.

Layer 3: the golden set

A golden set is a curated collection of representative inputs with expected properties. Not expected outputs, expected properties.

type GoldenCase struct {
    ID       string            `json:"id"`
    Input    string            `json:"input"`
    Expected map[string]string `json:"expected"`
}

// Example golden case
// {
//   "id": "billing_complaint_042",
//   "input": "I was charged twice for my subscription last month",
//   "expected": {
//     "tone": "empathetic",
//     "mentions": "refund OR credit OR billing",
//     "format": "paragraph under 150 words"
//   }
// }

I maintain 30-50 golden cases per feature. They cover common paths, known edge cases, and a few adversarial inputs. I run them weekly and after every prompt or model change.

The golden set is your regression detector. When a prompt change causes three previously passing golden cases to fail, you get a concrete signal that something shifted. No vibes. No arguments. Data.

The evaluation cadence that works

After trying several approaches, here’s what I’ve settled on:

Every commit: Run deterministic tests (layer 1). These are in CI and they block merges. Fast, stable, non-negotiable.
Every prompt/model change: Run the golden set (layer 3) and compare to the previous baseline. If pass rate drops, the change needs review.
Weekly: Run the full evaluation suite (layers 2 + 3) and track trends. Output a simple report: pass rate by criteria, any new failures, average response length.
After major updates: Human review of a random sample (~20 cases). Sanity check that the automated evaluation isn’t missing something.

This takes about two hours a week of human time. That’s a small investment for the confidence it provides.

What I wish more teams did

Version your prompts. Every prompt change should be a tracked commit with a diff. When quality regresses, you need to know which prompt version caused it. I keep prompts in version-controlled files, not in application code.

Track quality over time. A single evaluation run is a snapshot. A time series of evaluation results shows trends. Is quality gradually degrading? Did a model provider update cause a step change? You can’t answer these without historical data.

Test adversarial inputs. Your golden set should include attempts to jailbreak, confuse, or extract system prompts. These aren’t hypothetical attacks. They’re things real users will try.

LLM testing isn’t about proving the model is correct. It’s about building enough evidence that the system behaves acceptably across the inputs that matter. Layers, properties, golden sets, and a consistent cadence. That’s the strategy.