Quick take
Your AI cost isn’t what the pricing page says. It’s tokens times retries times fallbacks times human review – all shaped by your specific prompts and workload. Benchmark against your actual tasks or you’re optimizing fiction.
Every few weeks someone sends me a spreadsheet comparing AI provider pricing and asks “which one should we use?” The spreadsheet always compares cost per million tokens. It’s always useless.
After working on AI cost optimization since early 2024, I can tell you the gap between headline pricing and actual production cost is consistently 3-10x. Providers with the cheapest tokens sometimes end up being the most expensive per completed task. Here’s why and how to benchmark properly.
The Real Cost Stack
Token price is one line item. Production cost includes everything the system does to deliver a reliable result.
| Cost Layer | What It Includes | Typical Share |
|---|---|---|
| Model inference | Input + output tokens | 30-50% |
| Retries & fallbacks | Failed attempts, quality retries, provider failover | 10-25% |
| Retrieval & preprocessing | Embedding, search, context assembly | 10-20% |
| Human review | Escalation, QA sampling, edge case handling | 10-30% |
| Infrastructure | Caching, logging, orchestration | 5-10% |
Teams that track only model inference are missing half their spend. I learned this the hard way on a document processing pipeline: the retry rate on complex documents was 40%, effectively doubling model cost. The pricing spreadsheet didn’t mention that.
Benchmark Your Tasks, Not Generic Prompts
A useful benchmark mirrors your actual workload. Generic “summarize this article” tests tell you nothing about how a model handles your prompts, error rates, and latency requirements.
Build a benchmark set that covers:
| Task Category | Why It Matters | What to Measure |
|---|---|---|
| High-volume simple tasks | Dominates token count | Cost per success, latency p50 |
| Complex multi-step tasks | Dominates per-task spend | Total cost including retries, success rate |
| Edge cases / policy triggers | Drives fallback and review cost | Escalation rate, human time per case |
| Retrieval-heavy tasks | Preprocessing is a big chunk of cost | End-to-end cost, retrieval overhead ratio |
Keep this set stable. If benchmark inputs change every week, you can’t tell whether cost shifts came from system changes or test changes.
Compare Approaches, Not Providers
Provider names and model versions change quarterly. A benchmark built around “GPT-4 vs Claude 3.5” has a shelf life of weeks. Instead, compare the architectural choices you control:
| Approach | Cost Profile | When It Wins |
|---|---|---|
| Large model, single pass | High per-call, low retry | Simple tasks, tight latency budgets |
| Small model + reranker | Lower per-call, extra step | High volume, tolerance for pipeline complexity |
| Router: small for easy, large for hard | Variable, needs routing logic | Mixed workloads with clear difficulty signals |
| Self-hosted open model | Fixed infra cost, zero per-token | High volume, data residency, offline needs |
The router pattern is where I’ve seen the biggest wins. One team cut monthly spend by 60% by routing straightforward classification tasks to a small model and reserving the large model for generation. Classification accuracy from the small model was identical. They were paying frontier prices for commodity work.
The Drivers That Actually Move Your Bill
Forget micro-optimizing prompts. These four factors determine 80% of your cost trajectory:
Response length drift. Prompts evolve over time. Engineers add instructions, examples, formatting requirements. Output gets longer. Nobody notices until the bill does. Track average output tokens per task type weekly.
Retry rates. Every retry is a full cost event. If your validation rejects 20% of responses and retries, your effective cost is 1.25x the base. If it retries twice on failure, it’s worse. Measure retry rate by task type and fix the root cause.
Retrieval bloat. Context windows keep growing, so teams stuff more chunks in. More context means more input tokens. But past a point, more context doesn’t improve answers – it just costs more. Measure answer quality versus context size and find the plateau.
Routing waste. Sending everything to the most capable model is the default because it’s easy. It’s also the most expensive default. Any task where a smaller model achieves the same success rate is money burned on the large model.
Self-Hosting: When the Math Works
Self-hosting isn’t a cost optimization for most teams. It works for teams with specific constraints:
- Predictable, high-volume workloads where the per-token savings exceed infra costs
- Strict data residency or air-gapped environments
- Fine-tuned models that don’t exist as hosted APIs
For bursty workloads or teams that need frequent model upgrades, operational overhead eats the savings. I’ve talked a few teams out of self-hosting after we modeled actual GPU costs, ops burden, and iteration-speed penalties. The math didn’t work for them. It might for you. Run the numbers on your workload, not someone else’s blog post.
Set Up Monitoring Before You Need It
A benchmark is a snapshot. Production spend is a moving target. Set up cost monitoring from day one:
- Track cost per successful task, not cost per API call
- Break it down by feature and user tier
- Alert on spend spikes and retry rate increases
- Review monthly with someone who owns the budget
The teams that catch cost problems early treat them like performance regressions. The teams that catch them late treat them like budget emergencies.
Boring systems, predictable bills.