Vector Databases: What They Actually Are and When You Need One

Quick take

Vector databases store numeric representations of data and find the closest matches. That’s it. The real decisions are about indexing strategy, hybrid search, and whether you need a dedicated system or can extend what you already have. Most teams overthink the database choice and underthink the data pipeline.

Everyone is talking about vector databases right now, and about half of what I hear is wrong. So let me cut through the noise.

What Is Actually Stored

An embedding model turns text, images, or structured data into vectors – arrays of floating-point numbers that encode meaning. “Cat” and “kitten” end up close together in vector space. “Cat” and “quarterly earnings report” don’t.

A vector database stores these vectors and lets you find the closest ones to a query. That’s the core primitive: similarity search. Traditional databases give you exact matches, range queries, and transactions. Vector databases give you “find me the things most similar to this.”

type Document struct {
    ID        string            `json:"id"`
    Content   string            `json:"content"`
    Embedding []float64         `json:"embedding"`
    Metadata  map[string]string `json:"metadata"`
}

type SearchResult struct {
    Document Document
    Score    float64
}

That’s the data model. A document, its vector, and some metadata. Everything else is indexing and retrieval strategy.

Similarity Metrics

“Closest” needs a definition. The two common choices:

Cosine similarity measures the angle between vectors, ignoring magnitude. Good when you care about direction (meaning) and not length. Most text embedding models are trained with this in mind.

Dot product considers both direction and magnitude. Useful when the embedding model encodes importance in vector length.

Pick the metric that matches your embedding model’s training objective. The wrong choice can look fine on a small test set and produce garbage results at scale. I’ve seen this firsthand – a team using dot product with a model trained for cosine, wondering why their search quality was inconsistent.

func cosineSimilarity(a, b []float64) float64 {
    var dot, normA, normB float64
    for i := range a {
        dot += a[i] * b[i]
        normA += a[i] * a[i]
        normB += b[i] * b[i]
    }
    if normA == 0 || normB == 0 {
        return 0
    }
    return dot / (math.Sqrt(normA) * math.Sqrt(normB))
}

The Deployment Decision

There’s no universal answer. It depends on your constraints:

Extend your existing database. PostgreSQL with pgvector, for example. Good when vectors are one part of a larger transactional system. You get joins, transactions, and familiar operations alongside similarity search. The tradeoff is that you’re bolting a new capability onto infrastructure not designed for it. Fine for moderate scale. Starts to hurt when similarity search becomes the dominant workload.

Use a dedicated vector database. Pinecone, Weaviate, Qdrant, Milvus. Good when similarity search is central and you need low latency at scale. These systems are built for this workload. The tradeoff is another piece of infrastructure to operate, another data pipeline to maintain.

Go managed. Pinecone or similar. Good when operational overhead is a bigger risk than vendor lock-in. You pay more per query but you don’t page at 3am because the index rebuild failed.

For most teams I talk to, the honest answer is: start with pgvector. It’s good enough for most workloads, you already know how to operate Postgres, and you avoid adding a new system to your stack until you actually need to. Move to a dedicated solution when latency or scale forces your hand.

Indexing: Exact vs. Approximate

Exact search compares the query against every vector. Accurate, simple, and slow as your data grows. At 10,000 vectors it’s fine. At 10 million it isn’t.

Approximate nearest neighbor (ANN) indexes trade a small amount of accuracy for much better performance. HNSW is the most common algorithm – it builds a navigable graph that lets you skip most comparisons while still finding results that are close to the true nearest neighbors.

type IndexConfig struct {
    Type       string // "flat", "hnsw", "ivf"
    Dimensions int
    Metric     string // "cosine", "dot", "l2"

    // HNSW-specific
    EfConstruction int // build-time quality (higher = slower build, better recall)
    EfSearch       int // query-time quality (higher = slower query, better recall)
    M              int // connections per node (higher = more memory, better recall)
}

The parameters matter. EfConstruction and M control the build-time quality-speed tradeoff. EfSearch controls the same tradeoff at query time. Treat these as part of your performance budget. Tune them with your actual data and actual queries, not defaults from a blog post.

Chunking Strategy

Most text is too large to embed as a single vector. You need to chunk it into pieces that are small enough to be meaningful but large enough to carry context.

type ChunkConfig struct {
    MaxTokens int
    Overlap   int
    Separator string
}

func chunkDocument(content string, cfg ChunkConfig) []Chunk {
    sentences := splitSentences(content)
    var chunks []Chunk
    var current []string
    currentLen := 0

    for _, s := range sentences {
        sLen := tokenCount(s)
        if currentLen+sLen > cfg.MaxTokens && len(current) > 0 {
            chunks = append(chunks, Chunk{
                Text:  strings.Join(current, " "),
                Index: len(chunks),
            })
            // Keep overlap
            overlapStart := len(current) - cfg.Overlap
            if overlapStart < 0 {
                overlapStart = 0
            }
            current = current[overlapStart:]
            currentLen = tokenCountSlice(current)
        }
        current = append(current, s)
        currentLen += sLen
    }

    if len(current) > 0 {
        chunks = append(chunks, Chunk{
            Text:  strings.Join(current, " "),
            Index: len(chunks),
        })
    }

    return chunks
}

Key decisions: chunk size (I default to 256-512 tokens), overlap (10-20% prevents losing context at boundaries), and whether to chunk by sentences, paragraphs, or fixed token counts. Sentence-based chunking preserves meaning better. Fixed-token chunking is simpler.

Store metadata with every chunk: source document ID, position index, creation timestamp, access control labels. You’ll need all of these for filtering, ordering, and debugging.

Hybrid Search

Vector search is great at semantic matching. It’s terrible at exact matches. If a user searches for “RFC 7231” you want the exact document, not the three most semantically similar documents about HTTP specifications.

Combine vector search with keyword search. The simplest approach: run both, merge results, and let a weighted score decide the final ranking. More sophisticated approaches use reciprocal rank fusion or learn the weights from user behavior.

Don’t overthink the blending logic at the start. A simple weighted combination that you can adjust is better than a clever algorithm you can’t debug.

Reindexing

Embeddings go stale. Your data changes. Your embedding model gets updated. You need a reindexing pipeline that’s repeatable and safe.

Build it as a batch job that can run alongside the live index. Write to a new index, validate it against your eval set, then swap. Blue-green deployments for vectors. Nothing fancy, but it has to work reliably because a corrupted index is a corrupted search experience.

The Eval Loop

Before you ship, build a small eval set: 50-100 representative queries with known good results. Measure recall and latency. Then test again at production data volume. A vector search that works on 1,000 documents and breaks on 100,000 is a demo, not a product.

Track these numbers over time. Embedding model updates, data changes, and index configuration tweaks all affect quality. If you aren’t measuring, you’re guessing.

Bottom Line

Vector databases are infrastructure. Not magic, not a product differentiator – infrastructure. Choose based on your actual constraints, not the hype cycle. Start simple, measure everything, and upgrade when the data tells you to.