Quick take
Bigger embeddings aren’t always better. ada-002 is the best default for most teams. Self-hosted models are worth it only if you have data residency requirements or very high volume. Measure on your own data – public benchmarks are misleading for domain-specific content.
After building the semantic search system I wrote about two weeks ago , I wanted to understand how much the embedding model actually matters. Everyone has opinions. I wanted numbers.
I took our 15,000-chunk documentation corpus and the 150-query eval set and ran the same retrieval pipeline with five different embedding models. Same chunking, same hybrid retrieval logic, same ranking weights. Only the embeddings changed.
The models tested
| Model | Dimensions | Source | Cost per 1M tokens |
|---|---|---|---|
| OpenAI ada-002 | 1536 | API | $0.10 |
| Cohere embed-english-v3.0 | 1024 | API | $0.10 |
| sentence-transformers/all-MiniLM-L6-v2 | 384 | Self-hosted | Compute only |
| BAAI/bge-base-en-v1.5 | 768 | Self-hosted | Compute only |
| instructor-large | 768 | Self-hosted | Compute only |
I ran the self-hosted models on a single GPU instance (A10G). Not a fair cost comparison since you’d amortize that across many workloads, but it gives a ballpark.
The benchmark setup
The eval set has 150 queries. Each query has 1-5 relevant document IDs (manually verified). I measured:
- Precision@5: Of the top 5 results, how many are relevant?
- MRR (Mean Reciprocal Rank): How high is the first relevant result?
- Recall@10: Of all relevant docs, how many appear in the top 10?
The retrieval pipeline uses hybrid search (vector + keyword), so the embedding quality is only one factor. That’s intentional – I wanted to measure model impact in a realistic setup, not in isolation.
type BenchResult struct {
Model string
Precision5 float64
MRR float64
Recall10 float64
AvgLatency time.Duration // embedding generation time
IndexSizeMB float64
}
func benchmark(ctx context.Context, model EmbeddingModel, evalSet []EvalCase) BenchResult {
var result BenchResult
result.Model = model.Name()
for _, c := range evalSet {
start := time.Now()
queryVec, _ := model.Embed(ctx, c.Query)
result.AvgLatency += time.Since(start)
results := hybridSearch(ctx, queryVec, c.Query, 10)
topFive := results[:min(5, len(results))]
result.Precision5 += precision(extractIDs(topFive), c.RelevantIDs)
result.MRR += reciprocalRank(extractIDs(results), c.RelevantIDs)
result.Recall10 += recall(extractIDs(results), c.RelevantIDs)
}
n := float64(len(evalSet))
result.Precision5 /= n
result.MRR /= n
result.Recall10 /= n
result.AvgLatency /= time.Duration(len(evalSet))
return result
}
The results
| Model | Precision@5 | MRR | Recall@10 | Embed latency (avg) | Index size |
|---|---|---|---|---|---|
| ada-002 | 0.89 | 0.92 | 0.94 | 45ms | 92MB |
| Cohere v3 | 0.87 | 0.91 | 0.93 | 52ms | 61MB |
| bge-base-en-v1.5 | 0.85 | 0.89 | 0.91 | 8ms* | 46MB |
| instructor-large | 0.86 | 0.90 | 0.92 | 12ms* | 46MB |
| all-MiniLM-L6-v2 | 0.78 | 0.83 | 0.85 | 3ms* | 23MB |
*Self-hosted latency measured on A10G GPU. Your mileage will vary.
What surprised me
The gap is smaller than I expected. ada-002 wins, but bge-base is only 4 points behind on precision and it’s free to run (minus GPU compute). For teams with data residency requirements – and I’ve worked with several in telecom where data can’t leave certain jurisdictions – bge-base is a completely viable option.
MiniLM isn’t good enough. I see this model recommended everywhere as “good enough for most use cases.” For our content, it wasn’t. The 384-dimension space loses too much nuance for technical documentation. Queries with subtle distinctions (“payment reversal” vs “payment cancellation”) returned near-identical results. The other models correctly differentiated these.
Cohere v3 is underrated. Smaller vectors (1024 vs 1536) mean smaller indexes and faster distance calculations. Quality is barely behind ada-002. If index size or search latency matters at scale, this is worth considering.
Instructor-large with task-specific prefixes helped. This model lets you prefix the text with an instruction like “Represent this document for retrieval.” That prefix improved precision by about 2 points over embedding without it. Small but measurable. The downside is you need to be consistent about prefixes across ingestion and query time.
How to actually choose
Here’s my decision framework after running these numbers:
Use ada-002 if: You’re calling an API anyway, you want the best quality with no operational overhead, and your data can leave your infrastructure. This is the default choice for most teams and I don’t see a reason to fight it.
Use bge-base-en-v1.5 if: You need to self-host for data residency, cost at volume, or latency. It’s the best quality-to-size ratio in the open-source space right now. Run it on a GPU instance and batch your embedding calls.
Use Cohere v3 if: You want API convenience with smaller vectors. Good option if you’re building for scale and index size is a concern.
Skip MiniLM unless: You’re doing clustering or deduplication where near-miss quality is acceptable, or you’re running on CPU-only infrastructure and need the smallest possible model. For search and retrieval, it’s not good enough.
The embedding pipeline in practice
Whatever model you pick, the pipeline structure stays the same. Here’s the interface I use to swap models without changing the rest of the system:
type EmbeddingModel interface {
Name() string
Dimensions() int
Embed(ctx context.Context, text string) ([]float32, error)
EmbedBatch(ctx context.Context, texts []string) ([][]float32, error)
}
type OpenAIEmbedder struct {
client *openai.Client
model string
}
func (e *OpenAIEmbedder) EmbedBatch(ctx context.Context, texts []string) ([][]float32, error) {
resp, err := e.client.CreateEmbeddings(ctx, openai.EmbeddingRequest{
Model: openai.EmbeddingModel(e.model),
Input: texts,
})
if err != nil {
return nil, fmt.Errorf("openai embed: %w", err)
}
vectors := make([][]float32, len(resp.Data))
for i, d := range resp.Data {
vectors[i] = d.Embedding
}
return vectors, nil
}
For the self-hosted models, I run a small Go service that wraps the Python model behind a gRPC endpoint. Not glamorous but it keeps the main application in Go and isolates the Python dependency. The latency overhead from the extra hop is about 2ms, which is noise compared to the embedding computation itself.
Operational things I’ve learned
Version your embeddings. When you switch models, you need to re-embed everything. That means your index should track which model version generated each vector. I store a model_version column and re-index in the background, switching over atomically once the new index is complete.
Batch aggressively during ingestion. ada-002 handles batches of 2048 inputs. Sending one-at-a-time is burning money on request overhead. I batch in groups of 100 during ingestion and single-request during query time (where latency matters more).
Cache query embeddings. We get about 30% repeat queries. A simple Redis cache on the query embedding saves both latency and cost. Key on the query text, expire after 24 hours. This was a 5-line change that cut our embedding API costs by a quarter.
func (e *CachedEmbedder) Embed(ctx context.Context, text string) ([]float32, error) {
key := "emb:" + hash(text)
if cached, err := e.cache.Get(ctx, key); err == nil {
return deserializeVector(cached), nil
}
vec, err := e.inner.Embed(ctx, text)
if err != nil {
return nil, err
}
_ = e.cache.Set(ctx, key, serializeVector(vec), 24*time.Hour)
return vec, nil
}
The actual takeaway
The model matters less than you think. The difference between the best and worst model I tested (excluding MiniLM) was 4 points of precision. The difference between good and bad chunking – which I tested in the previous post – was 17 points. The difference between pure vector search and hybrid retrieval was also 17 points.
Spend your time on chunking and retrieval strategy. Pick an embedding model that fits your operational constraints and move on. The benchmarks on the MTEB leaderboard are interesting. Your eval set on your data is what actually matters.