Quick take
Stop blaming the LLM. If your RAG system gives bad answers, the retrieval is almost certainly the bottleneck. Hybrid search, proper chunking, query expansion, and reranking – measured separately from generation – will do more for answer quality than any prompt engineering trick.
I’ve built three different RAG systems this year, and each time the first complaint was “the model hallucinates.” Each time, the real problem was retrieval feeding garbage into context. The model was doing its best with bad evidence.
Basic RAG – embed the query, grab the top-k chunks, stuff them into the prompt – is a fragile baseline. It works in demos. It breaks on real data. Here’s why, and what to do about it.
Why Basic Retrieval Fails
The failure modes are predictable. I see the same ones everywhere:
Vocabulary mismatch. The user asks about “cancellation policy” but the source document says “termination terms.” Pure semantic search sometimes bridges this gap. Sometimes it doesn’t.
Context fragmentation. A paragraph that answers the question gets split across two chunks. Neither chunk scores high enough on its own. The answer exists in your corpus but the retrieval never finds it.
Wrong granularity. Your chunks are 512 tokens. The user asks a question that needs a 50-token fact buried in the middle. The surrounding noise tanks the relevance score.
Temporal confusion. The 2022 policy and the 2024 policy both match the query. The retrieval returns whichever embeds closer, not whichever is current.
Multi-hop requirements. The answer requires combining facts from two different documents. Single-query retrieval will find one, maybe. Not both.
Hybrid Search: Combine Signals
Pure vector search misses exact terms. Pure lexical search misses paraphrases. Combine them.
The implementation is straightforward. Run both searches, normalize the scores, and fuse the rankings. Reciprocal Rank Fusion (RRF) is the simplest approach that works:
package search
// RRFMerge combines results from multiple search backends using
// Reciprocal Rank Fusion. k controls how much rank position
// matters -- 60 is a common default.
func RRFMerge(results [][]Result, k float64) []Result {
scores := make(map[string]float64)
docs := make(map[string]Result)
for _, ranked := range results {
for rank, r := range ranked {
scores[r.ID] += 1.0 / (k + float64(rank+1))
docs[r.ID] = r
}
}
merged := make([]Result, 0, len(scores))
for id, score := range scores {
doc := docs[id]
doc.Score = score
merged = append(merged, doc)
}
sort.Slice(merged, func(i, j int) bool {
return merged[i].Score > merged[j].Score
})
return merged
}
From what I’ve seen, hybrid search with RRF improves recall by 15-30% over pure vector search on real corpora. Not synthetic benchmarks – real production data with messy, inconsistent documents.
Chunking Isn’t a Formatting Detail
Most teams treat chunking as a config parameter. Set chunk_size=512, done. This is wrong.
Good chunking preserves the structure of the source material. If your documents have headings, keep them. If a section is self-contained, chunk it as a unit. If a chunk can’t be understood without its parent heading, prepend a breadcrumb.
// Chunk represents a document fragment with enough context
// to be understood when retrieved independently.
type Chunk struct {
ID string
Content string
Breadcrumb string // e.g. "Policy Manual > Section 4 > Termination"
Source string
UpdatedAt time.Time
Tokens int
}
// ChunkWithContext prepends the breadcrumb to the content so the
// chunk is self-contained when injected into a prompt.
func (c Chunk) ChunkWithContext() string {
if c.Breadcrumb == "" {
return c.Content
}
return fmt.Sprintf("[%s]\n\n%s", c.Breadcrumb, c.Content)
}
The breadcrumb costs a few tokens per chunk. It pays for itself by making the model understand what it’s reading. Without it, the model gets a floating paragraph with no context about where it came from.
Query Expansion
Single-shot queries are narrow. The user types one phrasing, but the relevant document uses different words. You miss.
Query expansion generates alternative phrasings and retrieves against all of them. The simplest version that works: ask the LLM to generate 2-3 reformulations, then run all queries and merge results.
A more interesting approach is HyDE (Hypothetical Document Embeddings). Instead of expanding the query, generate a hypothetical answer and embed that. The intuition is that a hypothetical answer is closer in embedding space to the actual answer than the question is.
// ExpandQuery generates alternative phrasings for retrieval.
// Returns the original query plus expansions.
func ExpandQuery(ctx context.Context, llm LLM, query string, n int) ([]string, error) {
prompt := fmt.Sprintf(
"Generate %d alternative phrasings of this search query. "+
"Return only the queries, one per line.\n\nQuery: %s",
n, query,
)
resp, err := llm.Complete(ctx, prompt)
if err != nil {
// Fallback: just use the original query.
return []string{query}, nil
}
queries := []string{query}
for _, line := range strings.Split(resp, "\n") {
line = strings.TrimSpace(line)
if line != "" {
queries = append(queries, line)
}
}
return queries, nil
}
Note the error handling: if expansion fails, fall back to the original query. Don’t let a retrieval enhancement become a retrieval blocker.
Expansion increases recall, but it also brings in noise. That’s fine, because the next step handles it.
Reranking: The Cleanup Step
After gathering candidates from hybrid search across expanded queries, you have a broad set. Most of it is relevant. Some isn’t. A reranker fixes the ordering.
A cross-encoder reranker compares the full query against the full chunk text. It’s slower than embedding similarity but significantly more accurate for the final ranking. Run it on your top 20-50 candidates, not your entire corpus.
// Rerank takes candidate chunks and reorders them by relevance
// using a cross-encoder model. Keep topN results.
func Rerank(ctx context.Context, model Reranker, query string, candidates []Chunk, topN int) ([]Chunk, error) {
type scored struct {
chunk Chunk
score float64
}
pairs := make([]QueryDocPair, len(candidates))
for i, c := range candidates {
pairs[i] = QueryDocPair{Query: query, Document: c.ChunkWithContext()}
}
scores, err := model.Score(ctx, pairs)
if err != nil {
return candidates[:topN], nil // degrade gracefully
}
ranked := make([]scored, len(candidates))
for i := range candidates {
ranked[i] = scored{chunk: candidates[i], score: scores[i]}
}
sort.Slice(ranked, func(i, j int) bool {
return ranked[i].score > ranked[j].score
})
result := make([]Chunk, 0, topN)
for i := 0; i < topN && i < len(ranked); i++ {
result = append(result, ranked[i].chunk)
}
return result, nil
}
Again, graceful degradation. If the reranker fails, return the original order truncated to topN. The system should always return something useful.
Multi-Representation Indexing
One embedding per document is leaving retrieval quality on the table. For important documents, index multiple representations:
- The full text (for detail queries)
- A concise summary (for broad queries)
- Question-like phrasings that the text answers (for direct questions)
This widens the retrieval surface without changing the source documents. It’s extra indexing work, but the recall improvement on multi-hop queries is substantial. I’ve seen it close the gap on questions that basic retrieval missed entirely.
Measure Retrieval Separately
This is the part most teams skip, and it’s the most important.
If you only measure end-to-end answer quality, you can’t tell whether a bad answer came from bad retrieval or bad generation. You need retrieval-specific metrics:
- Recall@k: Did the relevant chunk appear in the top k results?
- Precision@k: What fraction of the top k results were actually relevant?
- MRR (Mean Reciprocal Rank): How high did the first relevant result rank?
- nDCG: How well-ordered is the full ranking?
Build a small eval set – 50 to 100 query-document pairs where you know which chunks should be retrieved. Run it after every change to chunking, embedding, or search logic. This is the single highest-leverage investment in a RAG system.
I keep these eval sets in the repo alongside the retrieval code. They’re as important as unit tests. Maybe more important.
The Full Pipeline
Putting it all together, the retrieval pipeline for a production RAG system looks like:
- Expand the query (2-3 reformulations)
- Run hybrid search (vector + lexical) for each query variant
- Merge results with RRF
- Rerank the merged candidates
- Return top-k chunks with breadcrumbs
Each step is independently testable and independently measurable. When something breaks, you know where to look.
The generation step is almost an afterthought once retrieval is solid. A decent model with the right evidence in context will give you a good answer. A frontier model with the wrong evidence will confidently give you a wrong one.
Fix retrieval first. Everything else follows.