RAG Patterns That Actually Work in Production

Quick take

Retrieval quality determines RAG quality. The generation model can’t fix bad retrieval. Invest in chunking, hybrid search, and a solid eval pipeline before you touch anything else. Fancy patterns – reranking, query expansion, context compression – only matter after the basics are right.

At a financial infrastructure company, we deal with financial documentation – ledger specs, API docs, and compliance guides – that changes frequently and lives outside any model’s training data. RAG isn’t optional for us. It’s the architecture. After building several iterations, I have strong opinions about what works and what’s a waste of time.

RAG isn’t a single pattern. It’s a family of techniques you combine based on your data shape, latency budget, and how much you care about accuracy. Here’s what I’ve learned.

Why RAG, Not Fine-Tuning

This question comes up constantly: “Why not just fine-tune the model on our data?”

Three reasons. First, model knowledge is frozen at training time, while your data keeps changing. Unless you want to retrain every week, you need retrieval. Second, fine-tuning doesn’t give you citations. RAG lets you trace every answer back to its source, which matters a lot when money is involved. Third, fine-tuning is expensive and slow. Adding a document to a retrieval index takes seconds.

RAG closes three gaps that make LLMs dangerous in production: stale knowledge, confident hallucinations, and no access to private data.

The Core Pipeline

Every RAG system has two flows: ingestion and query.

// Ingestion: prepare documents for retrieval
type IngestionPipeline struct {
    Parser    DocumentParser
    Chunker   Chunker
    Embedder  Embedder
    Store     VectorStore
}

func (p *IngestionPipeline) Ingest(ctx context.Context, doc RawDocument) error {
    parsed, err := p.Parser.Parse(ctx, doc)
    if err != nil {
        return fmt.Errorf("parse: %w", err)
    }

    chunks := p.Chunker.Chunk(parsed)

    for i, chunk := range chunks {
        embedding, err := p.Embedder.Embed(ctx, chunk.Text)
        if err != nil {
            return fmt.Errorf("embed chunk %d: %w", i, err)
        }

        err = p.Store.Upsert(ctx, VectorRecord{
            ID:        fmt.Sprintf("%s-chunk-%d", doc.ID, i),
            Vector:    embedding,
            Content:   chunk.Text,
            Metadata:  chunk.Metadata,
            SourceID:  doc.ID,
            ChunkIdx:  i,
            UpdatedAt: time.Now(),
        })
        if err != nil {
            return fmt.Errorf("store chunk %d: %w", i, err)
        }
    }

    return nil
}

// Query: retrieve context and generate
type QueryPipeline struct {
    Embedder  Embedder
    Store     VectorStore
    Reranker  Reranker // optional
    LLM       LLMClient
    MaxChunks int
}

func (p *QueryPipeline) Answer(ctx context.Context, question string) (*Response, error) {
    qEmbedding, err := p.Embedder.Embed(ctx, question)
    if err != nil {
        return nil, fmt.Errorf("embed query: %w", err)
    }

    candidates, err := p.Store.Search(ctx, qEmbedding, p.MaxChunks*3)
    if err != nil {
        return nil, fmt.Errorf("search: %w", err)
    }

    if len(candidates) == 0 {
        return &Response{
            Answer:  "I don't have enough information to answer that question.",
            Sources: nil,
        }, nil
    }

    // Optional reranking pass
    if p.Reranker != nil {
        candidates, err = p.Reranker.Rerank(ctx, question, candidates)
        if err != nil {
            // Log but don't fail -- use original ranking
            log.Printf("rerank failed, using original order: %v", err)
        }
    }

    // Take top chunks
    if len(candidates) > p.MaxChunks {
        candidates = candidates[:p.MaxChunks]
    }

    context := formatContext(candidates)
    prompt := buildRAGPrompt(question, context)

    answer, err := p.LLM.Complete(ctx, prompt)
    if err != nil {
        return nil, fmt.Errorf("generate: %w", err)
    }

    return &Response{
        Answer:  answer,
        Sources: extractSources(candidates),
    }, nil
}

This looks simple on paper. Most production work is in the details.

Pattern 1: Get Retrieval Right First

This is the single most important thing I can tell you: if retrieval is bad, generation won’t save you. The model can only work with what it sees. Feed it irrelevant chunks and you get irrelevant answers with high confidence. That’s worse than no answer.

Chunking is where most teams fail. Chunks that are too small lose context. Chunks that are too large dilute relevance. I default to 256-512 tokens with 10-15% overlap at sentence boundaries. But the right size depends on your content. Technical documentation with dense terminology needs smaller chunks. Narrative content can go larger.

Metadata makes or breaks filtering. Every chunk needs a source document ID, position index, timestamp, and whatever access control labels your system requires. At a financial infrastructure company, we tag chunks with document type (API docs vs. guides vs. changelogs) so we can filter at query time. A user asking about the ledger API shouldn’t get results from the changelog.

Hybrid search is almost always better than pure vector search. Vector search handles semantic similarity well. It handles exact matches terribly. When someone searches for a specific transaction ID or an error code, you need keyword matching. Combine both:

type HybridSearcher struct {
    Vector  VectorStore
    Keyword KeywordIndex
    Alpha   float64 // 0 = keyword only, 1 = vector only
}

func (h *HybridSearcher) Search(ctx context.Context, query string, embedding []float64, limit int) ([]SearchResult, error) {
    vectorResults, err := h.Vector.Search(ctx, embedding, limit*2)
    if err != nil {
        return nil, err
    }

    keywordResults, err := h.Keyword.Search(ctx, query, limit*2)
    if err != nil {
        return nil, err
    }

    return reciprocalRankFusion(vectorResults, keywordResults, h.Alpha, limit), nil
}

Start with Alpha = 0.7 (favoring semantic search) and adjust based on your eval results.

Pattern 2: Query Shaping

Users don’t write queries optimized for search. “How do I fix the thing that broke yesterday” is a real query. Your retrieval system needs to handle that.

Query shaping rewrites the user question into a form that retrieves better results. Three approaches:

Query rewriting: Use the LLM to rephrase the question into search-friendly terms. Cheap and surprisingly effective.

Query expansion: Generate multiple variants of the question and search with all of them. Increases recall at the cost of latency.

Hypothetical document embedding (HyDE): Generate a short hypothetical answer, embed that, and use it as the search query. This works because the hypothetical answer is in the same semantic space as the real documents. Clever, and it works better than it should.

Measure everything. Track whether query shaping increases recall without degrading precision. If it adds noise, revert it. I’ve seen query expansion improve results on one corpus and destroy them on another.

Pattern 3: Reranking

Initial retrieval is a fast, coarse filter. Reranking is a slower, more precise second pass. You retrieve 3x more candidates than you need, then use a cross-encoder or stronger model to reorder them.

This is worth doing when your corpus is large and diverse, or when precision matters more than speed. For our documentation, reranking improved answer quality noticeably because initial retrieval would sometimes rank a vaguely related API doc above the directly relevant one.

Treat reranking as a knob. Turn it on for high-stakes queries. Skip it when speed matters more.

Pattern 4: Context Window Management

Models have finite context windows. Your documents are verbose. Sending everything is wasteful and sometimes counterproductive – the model can get confused by too much context.

Select aggressively. Extract only the sentences or paragraphs tied to the question. Summarize long documents into focused briefs before placing them in the prompt.

But always keep pointers to the original sources. If a user can’t verify where an answer came from, the system isn’t trustworthy. At that company, every answer includes source references. Non-negotiable for anything involving financial data.

Pattern 5: Multi-Index Routing

Not all data is the same. API documentation, user guides, support tickets, and changelogs all behave differently under the same retrieval strategy. A single index forces compromises that hurt quality on everything.

Route queries to the appropriate index based on the question type. This can be simple:

func routeQuery(question string, metadata map[string]string) string {
    if containsCodeTerms(question) {
        return "api-docs"
    }
    if containsErrorTerms(question) {
        return "troubleshooting"
    }
    return "general"
}

Rule-based routing is fine to start. Evolve toward a learned router only when you have enough data to train one.

Evaluation: The Part Nobody Wants To Do

RAG fails in subtle ways. Retrieval looks fine but the answer misses a critical detail. The sources are correct but the generation misinterprets them. You won’t catch these with unit tests.

Build an eval set. At minimum:

50+ queries with known correct answers and expected source documents
Retrieval recall: did the right chunks come back?
Answer faithfulness: does the answer match the retrieved context?
User feedback: track thumbs up/down, re-prompts, and abandonment

Run this eval on every change to the pipeline. Chunking strategy change? Run eval. New embedding model? Run eval. Prompt update? Run eval.

Pair offline eval with production monitoring. Watch for queries with zero results, answers that cite no sources, and latency spikes. These are your early warnings.

The Boring But Essential Stuff

Cache frequent queries. If staleness is acceptable (and for documentation it usually is), cache aggressively.

Build clear fallback paths. When retrieval returns nothing relevant, say so. “I don’t have information about that” is better than a hallucinated answer.

Log everything: the query, the retrieved chunks, the generated answer, and the sources. You’ll need this for debugging and for improving the system.

None of this is exciting. All of it is what keeps the system running under real traffic.

Start Simple

RAG is a toolkit, not a checkbox. Start with a clean ingestion pipeline and basic vector search. Measure retrieval quality. Add hybrid search when you see exact-match failures. Add reranking when precision matters. Add query shaping when user queries are consistently poor.

The best production RAG systems I’ve seen are the ones that resisted complexity until the data demanded it.