LLM Integration Patterns That Actually Survive Production

Quick take

LLMs aren’t APIs. Same input, different output. Seconds of latency instead of milliseconds. Costs that scale with how much you talk. Treat them like the weird, expensive, unreliable-but-powerful dependency they are, and design accordingly.

I’ve been integrating LLMs into production systems since late 2022, and the one thing I keep telling teams is this: the model isn’t the hard part. The integration is.

An LLM call looks like an API call. It isn’t. It’s probabilistic, slow, expensive, and context-limited. If you design your system assuming deterministic behavior, you’ll have a bad time. If you design for variability from the start, everything gets easier.

Here are the patterns I keep reaching for.

The Constraints You Can’t Ignore

Before patterns, constraints. These shape every decision:

Non-determinism. Same prompt, different response. Small wording changes shift behavior. This is a feature of the technology, not a bug. But it means you need validation layers that traditional API integrations don’t.

Latency. We’re talking seconds, not milliseconds. Your UX needs to account for streaming, progress indicators, and the possibility that a request simply takes too long.

Cost. Tokens in, tokens out, money gone. Long prompts and verbose outputs are the biggest cost drivers. I’ve seen teams 10x their bill by not paying attention to prompt length.

Context limits. Everything – instructions, data, output – competes for the same window. Summarization and retrieval aren’t nice-to-haves. They’re architectural requirements.

Hallucinations. The model will confidently make things up. Without grounding or validation, you’ll ship lies to users.

Pattern 1: Prompt Templates as Code

Prompts aren’t strings you tweak in a UI. They’re code. Version them. Test them. Review them.

Here is what a basic prompt template looks like in Go:

type PromptTemplate struct {
    Name       string
    Version    string
    SystemMsg  string
    UserMsgFmt string
}

func (pt *PromptTemplate) Render(vars map[string]string) (string, string) {
    userMsg := pt.UserMsgFmt
    for k, v := range vars {
        userMsg = strings.ReplaceAll(userMsg, "{{"+k+"}}", v)
    }
    return pt.SystemMsg, userMsg
}

var SummarizeTemplate = PromptTemplate{
    Name:       "summarize-v2",
    Version:    "2.1.0",
    SystemMsg:  "You are a concise summarizer. Output only the summary, no preamble.",
    UserMsgFmt: "Summarize this text in {{max_sentences}} sentences:\n\n{{text}}",
}

The version field matters. When model behavior shifts – and it will, with provider updates – you need to know which prompt version was running. Log it with every request. Thank me later.

I run lightweight tests against a small eval set: representative inputs, expected outputs, and a set of “this must never appear” strings. Not perfect, but it catches regressions before they reach users.

Pattern 2: Structured Output Validation

When the output feeds another system, you need it to be structured and correct. Request JSON, validate against a schema, retry on failure.

type ExtractedEntity struct {
    Name       string  `json:"name" validate:"required"`
    Category   string  `json:"category" validate:"required"`
    Confidence float64 `json:"confidence" validate:"gte=0,lte=1"`
}

func extractEntities(ctx context.Context, client *LLMClient, text string) ([]ExtractedEntity, error) {
    prompt := fmt.Sprintf(`Extract entities from this text. Return valid JSON array.
Each object must have: name (string), category (string), confidence (0-1).
Text: %s`, text)

    for attempt := 0; attempt < 3; attempt++ {
        resp, err := client.Complete(ctx, prompt)
        if err != nil {
            return nil, fmt.Errorf("completion failed: %w", err)
        }

        var entities []ExtractedEntity
        if err := json.Unmarshal([]byte(resp), &entities); err != nil {
            continue // retry on parse failure
        }

        if err := validate.Struct(entities); err != nil {
            continue // retry on validation failure
        }

        return entities, nil
    }

    return nil, fmt.Errorf("failed to extract valid entities after 3 attempts")
}

Three retries is my default. If it can’t produce valid JSON in three attempts, something is wrong with the prompt, not the retry count. I also add the schema to the prompt itself – the model needs to see what “correct” looks like.

Pattern 3: Retrieval Grounding (RAG)

This is the default pattern for anything knowledge-heavy. The model’s training data is stale and generic. Your data is current and specific. Bridge the gap with retrieval.

The flow: index your documents, retrieve relevant chunks at query time, stuff them into the context, and instruct the model to answer only from what it sees.

type RAGPipeline struct {
    Retriever  DocumentRetriever
    LLM        *LLMClient
    MaxChunks  int
    MaxTokens  int
}

func (r *RAGPipeline) Answer(ctx context.Context, question string) (string, []Source, error) {
    chunks, err := r.Retriever.Search(ctx, question, r.MaxChunks)
    if err != nil {
        return "", nil, fmt.Errorf("retrieval failed: %w", err)
    }

    if len(chunks) == 0 {
        return "I don't have enough information to answer that.", nil, nil
    }

    context := formatChunks(chunks)
    prompt := fmt.Sprintf(`Answer based ONLY on the provided context.
If the context doesn't contain the answer, say so.

Context:
%s

Question: %s`, context, question)

    answer, err := r.LLM.Complete(ctx, prompt)
    if err != nil {
        return "", nil, fmt.Errorf("generation failed: %w", err)
    }

    sources := extractSources(chunks)
    return answer, sources, nil
}

Two things I always enforce: the “answer only from context” instruction, and returning sources alongside the answer. The first reduces hallucinations. The second makes them detectable.

Pattern 4: Tool Use With Guardrails

Tool-using models are powerful and dangerous. A model that can call your search API, query your database, or trigger workflows needs tight constraints.

type ToolDefinition struct {
    Name        string
    Description string
    Handler     func(ctx context.Context, args json.RawMessage) (string, error)
    MaxCalls    int
    Timeout     time.Duration
}

type ToolOrchestrator struct {
    Tools      map[string]ToolDefinition
    MaxSteps   int
    StepTimeout time.Duration
}

func (o *ToolOrchestrator) Execute(ctx context.Context, task string) (string, error) {
    for step := 0; step < o.MaxSteps; step++ {
        toolCall, err := o.decideNextAction(ctx, task)
        if err != nil || toolCall == nil {
            break
        }

        tool, ok := o.Tools[toolCall.Name]
        if !ok {
            return "", fmt.Errorf("unknown tool: %s", toolCall.Name)
        }

        stepCtx, cancel := context.WithTimeout(ctx, tool.Timeout)
        result, err := tool.Handler(stepCtx, toolCall.Args)
        cancel()

        if err != nil {
            // Log and continue -- don't let one tool failure kill the chain
            log.Printf("tool %s failed at step %d: %v", toolCall.Name, step, err)
            continue
        }

        task = appendResult(task, toolCall.Name, result)
    }

    return o.generateFinalAnswer(ctx, task)
}

Key constraints: maximum steps, per-tool timeouts, and an explicit tool whitelist. I’ve seen demos where the model calls 47 tools in a loop and racks up a $200 bill. Don’t be that team. A smaller toolset with stronger constraints beats an open-ended agent every time.

Pattern 5: Caching and Fallbacks

Cache everything you can. Many prompts are repetitive, especially in search and classification workloads. A warm cache cuts both cost and latency.

For failures, use a fallback chain:

Try the primary model
Fall back to a cheaper, faster model
Fall back to a rule-based response
Return a safe default (“I couldn’t process that request”)

Reliability matters more than perfect answers. Users forgive “I don’t know.” They don’t forgive confidently wrong.

Operating in Production

Once this is running, you need visibility. I track four signals:

Quality on a curated eval set, run weekly. Not vibes – actual measured accuracy.
Latency percentiles (p50 and p95) for user-facing calls.
Cost per request, broken down by prompt size so you can spot bloat.
Safety exceptions – anything the output filter catches.

Combine automated checks with periodic human review. This is a living system. You’ll be tuning it weekly for the first few months.

The Uncomfortable Truth

LLM integration is a new discipline. It borrows from API design, data engineering, and observability, but it’s its own thing. The teams shipping well are the ones who accepted that early and designed for the constraints instead of pretending they don’t exist.

Build for variability. Ground in real data. Validate everything that matters. And keep your fallbacks warm.