Building Semantic Search in Go: From Embeddings to Production

Quick take

Semantic search isn’t hard to build. It’s hard to build well. The difference is in chunking, hybrid retrieval, and having an eval set before you ship – not in which vector database you picked.

I built a semantic search system last month for a documentation corpus. About 15,000 pages of technical docs, financial API references, and internal knowledge base articles. The old keyword search was painful – users searching for “how to reverse a payment” would get zero results because the docs called it “transaction reversal” or “credit adjustment.” Classic vocabulary mismatch.

Semantic search fixes this. Here’s how I built it, what worked, and what I’d do differently.

The architecture

Nothing exotic. The pipeline looks like this:

Documents -> chunk -> embed (OpenAI) -> store (pgvector)
Query -> embed -> vector search + keyword search -> merge & rank -> return

I chose pgvector over a dedicated vector database because the project already used PostgreSQL. One fewer service to operate. The performance has been fine for our scale – sub-50ms queries at 500K vectors. If you’re at millions of vectors with hard latency requirements, Pinecone or Weaviate might make more sense. But don’t start there.

Chunking: where most people get it wrong

Chunking strategy has more impact on search quality than model choice. I learned this the hard way at the fintech startup years ago when we were building news search – the same principle applies to embeddings.

The naive approach is fixed-size chunks (500 tokens, overlap 50). It’s easy to implement and mediocre at everything. The problem: a chunk that starts mid-paragraph and ends mid-sentence creates an embedding that represents… nothing coherent.

What worked better:

func chunkDocument(doc Document) []Chunk {
    sections := splitByHeadings(doc.Content)
    var chunks []Chunk

    for _, section := range sections {
        if tokenCount(section.Text) <= maxChunkTokens {
            chunks = append(chunks, Chunk{
                Text:     section.Text,
                Title:    section.Heading,
                Source:   doc.URL,
                Section:  section.Heading,
            })
            continue
        }

        // Split long sections by paragraph, keeping heading as context
        paragraphs := splitByParagraph(section.Text)
        for _, para := range paragraphs {
            if tokenCount(para) < minChunkTokens {
                continue // Skip tiny fragments
            }
            chunks = append(chunks, Chunk{
                Text:     section.Heading + "\n\n" + para,
                Title:    section.Heading,
                Source:   doc.URL,
                Section:  section.Heading,
            })
        }
    }
    return chunks
}

Key decisions here:

Split on document structure first (headings), then on paragraphs. Never mid-sentence.
Prepend the section heading to each chunk. This gives the embedding model context about what the paragraph is about. A paragraph saying “To do this, call the /refund endpoint” makes much more sense when preceded by “Processing Refunds.”
Skip tiny chunks. Fragments under ~50 tokens produce noisy embeddings. Either merge them with adjacent content or drop them.
Store metadata. Source URL, section heading, document title. You’ll need all of these for display, filtering, and debugging.

I tested three chunk sizes: 200, 400, and 800 tokens. 400 hit the sweet spot for our content. Smaller chunks had better precision but worse context. Larger chunks had more context but diluted the signal. Test this with your own content – the right size depends on document structure.

Embedding and indexing

OpenAI’s text-embedding-ada-002 for now. It’s cheap ($0.0001/1K tokens), 1536 dimensions, and good enough for English technical content. I batch embed during ingestion:

func embedChunks(ctx context.Context, client *openai.Client, chunks []Chunk) error {
    const batchSize = 100
    for i := 0; i < len(chunks); i += batchSize {
        end := min(i+batchSize, len(chunks))
        batch := chunks[i:end]

        texts := make([]string, len(batch))
        for j, c := range batch {
            texts[j] = c.Text
        }

        resp, err := client.CreateEmbeddings(ctx, openai.EmbeddingRequest{
            Model: openai.AdaEmbeddingV2,
            Input: texts,
        })
        if err != nil {
            return fmt.Errorf("embedding batch %d: %w", i/batchSize, err)
        }

        for j, embedding := range resp.Data {
            batch[j].Vector = embedding.Embedding
        }

        if err := storeChunks(ctx, batch); err != nil {
            return fmt.Errorf("storing batch %d: %w", i/batchSize, err)
        }
    }
    return nil
}

The pgvector schema:

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE chunks (
    id          BIGSERIAL PRIMARY KEY,
    doc_url     TEXT NOT NULL,
    section     TEXT NOT NULL,
    content     TEXT NOT NULL,
    embedding   vector(1536) NOT NULL,
    created_at  TIMESTAMPTZ DEFAULT now()
);

CREATE INDEX ON chunks USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

IVFFlat index with 100 lists works well for our dataset size. For larger collections, HNSW gives better recall at the cost of more memory. The pgvector docs have good guidance on when to switch.

Hybrid retrieval: the part most tutorials skip

Pure vector search has a problem. It’s great at “how do I reverse a payment” -> “transaction reversal.” It’s terrible at “error code FIN-4032” -> the page about that specific error code. Exact matches get lost in the embedding space because semantically similar concepts crowd them out.

The fix is hybrid retrieval: combine vector similarity with keyword matching.

func hybridSearch(ctx context.Context, db *pgxpool.Pool, query string, limit int) ([]Result, error) {
    queryVec, err := embed(ctx, query)
    if err != nil {
        return nil, err
    }

    rows, err := db.Query(ctx, `
        WITH vector_results AS (
            SELECT id, content, doc_url, section,
                   1 - (embedding <=> $1::vector) AS vector_score
            FROM chunks
            ORDER BY embedding <=> $1::vector
            LIMIT $2 * 3
        ),
        keyword_results AS (
            SELECT id, content, doc_url, section,
                   ts_rank(to_tsvector('english', content),
                           plainto_tsquery('english', $3)) AS keyword_score
            FROM chunks
            WHERE to_tsvector('english', content) @@ plainto_tsquery('english', $3)
            LIMIT $2 * 3
        )
        SELECT COALESCE(v.id, k.id) AS id,
               COALESCE(v.content, k.content) AS content,
               COALESCE(v.doc_url, k.doc_url) AS doc_url,
               COALESCE(v.section, k.section) AS section,
               COALESCE(v.vector_score, 0) * 0.7
                 + COALESCE(k.keyword_score, 0) * 0.3 AS combined_score
        FROM vector_results v
        FULL OUTER JOIN keyword_results k ON v.id = k.id
        ORDER BY combined_score DESC
        LIMIT $2
    `, queryVec, limit, query)
    if err != nil {
        return nil, err
    }
    defer rows.Close()

    return scanResults(rows)
}

The 0.7/0.3 weighting between vector and keyword scores is a starting point. I tuned it against our eval set and landed on 0.65/0.35 for our specific content. The important thing is having both signals. Pure vector search gave us 72% precision@5. Hybrid pushed it to 89%.

Evaluation: do this first, not last

I know I’m putting this section after the implementation code. Don’t build it in this order. Build the eval set first.

Our eval set is 150 query-document pairs. I pulled 100 from actual search logs (what people searched for and which doc they ended up reading) and wrote 50 more to cover edge cases: typos, abbreviations, multi-language queries, and questions that should return nothing.

type EvalCase struct {
    Query       string   `json:"query"`
    RelevantIDs []string `json:"relevant_ids"`
    Notes       string   `json:"notes"`
}

func runEval(ctx context.Context, cases []EvalCase) EvalReport {
    var report EvalReport
    for _, c := range cases {
        results, _ := hybridSearch(ctx, db, c.Query, 5)
        resultIDs := extractIDs(results)

        report.Cases = append(report.Cases, CaseResult{
            Query:      c.Query,
            Precision:  precision(resultIDs, c.RelevantIDs),
            Recall:     recall(resultIDs, c.RelevantIDs),
            MRR:        meanReciprocalRank(resultIDs, c.RelevantIDs),
        })
    }
    return report
}

Run this on every change: different chunk sizes, embedding models, ranking weights, filter logic. Without it you’re optimizing by vibes.

What I’d do differently

Start with hybrid from day one. I spent a week on pure vector search, hit the exact-match problem, and had to retrofit keyword scoring. Should have started hybrid.

Invest in chunk quality earlier. My first pass used fixed-size chunks and the results were mediocre. I spent more time debugging “why did this irrelevant result show up” than I would have spent writing a proper chunker.

Cache embeddings for common queries. We get a lot of repeat queries. Caching the query embedding saves an API call and ~200ms per request. Obvious in retrospect.

Don’t over-index on the vector database choice. I spent three days evaluating Pinecone vs. Weaviate vs. pgvector. Should have spent three hours. At our scale, they all work. Pick the one that fits your existing stack and move on.

What matters

Semantic search is production-ready infrastructure now. The hard parts aren’t the vector math – they’re the same boring engineering problems as always: data quality (chunking), evaluation (do you actually measure relevance?), and hybrid approaches (because no single signal is enough).

Build the eval set first. Chunk on document structure. Use hybrid retrieval. Everything else is tuning.