Quick take
Benchmarks mislead. The right model depends on your tasks, latency requirements, cost tolerance, and how much ops overhead you can absorb. Run a bake-off on your actual workload. Route between models. Stop looking for a universal winner.
I get asked “which model should we use?” at least once a week. The answer is always the same: it depends. That answer always disappoints, so let me make it useful.
The late-2024 model landscape is competitive enough that the gap between top-tier providers on general tasks is small. The differences that matter most are operational, not intellectual. Here’s how I think about model selection for production systems.
The Landscape at a Glance
Two tracks dominate. Hosted APIs from Anthropic, OpenAI, and Google iterate fast and are easiest to ship with. Open-weight models from Meta (Llama), Mistral, and others give you more control but come with infrastructure baggage.
| Track | Strengths | Weaknesses | Best For |
|---|---|---|---|
| Hosted API (frontier) | Latest capability, zero ops, fast iteration | Cost at scale, vendor dependency, data leaves your infra | Most teams starting out, complex reasoning tasks |
| Hosted API (mid-tier) | Good cost/quality ratio, same deployment simplicity | Weaker on complex tasks, less controllable | High-volume simple tasks, routing targets |
| Open-weight (large) | Data control, no per-token cost at scale, fine-tunable | GPU costs, ops burden, slower model updates | High volume, data residency, offline |
| Open-weight (small) | Fast inference, cheap, embeddable | Limited capability, more prompt engineering | Classification, extraction, edge deployment |
What to Actually Compare
Forget leaderboards. They’re narrow, gameable, and rarely match your workload. Here are the dimensions that matter in production:
| Dimension | What to Measure | Why It Matters |
|---|---|---|
| Task fit | Success rate on your actual prompts | A model that aces coding benchmarks might fail your extraction tasks |
| Latency | p50 and p95 with realistic prompt sizes | Average latency hides tail problems that users feel |
| Cost per success | Total spend per completed task, including retries | Cheap per-token doesn’t mean cheap per-task |
| Structured output | JSON/schema compliance rate | Critical if downstream code parses the response |
| Tool use | Accuracy of function calling and parameter extraction | Bad tool calls are worse than no tool calls |
| Safety/controllability | Refusal rates, policy adherence, output consistency | Too permissive or too restrictive both cause problems |
| Context handling | Quality at 8k, 32k, 128k+ tokens | Long context support isn’t the same as long context quality |
I’ve run these comparisons for teams I’ve worked with. The results consistently surprise people. The “best” model on paper is rarely the best model for their specific tasks.
How to Run a Bake-Off
Don’t spend a month on this. A focused bake-off should take a few days:
- Pick 30-50 representative inputs from your actual workload. Cover the common cases and the hard cases.
- Define success criteria for each one. Not vibes, specific, checkable criteria.
- Run each model against the same inputs with the same system prompt.
- Score each model by task success rate, latency, and cost.
- Check structured output compliance if you depend on it.
The results won’t be close on every dimension. One model will be cheaper. Another will be more accurate on complex tasks. A third will have better latency. That’s the point – you’re mapping the tradeoff space, not finding a winner.
The Router Pattern
Once you have bake-off data, the next step is obvious: route different task types to different models.
| Task Type | Route To | Rationale |
|---|---|---|
| Simple classification / extraction | Small or mid-tier model | High volume, accuracy is sufficient, saves 60-80% |
| Complex reasoning / generation | Frontier model | Quality matters, volume is lower |
| Structured data extraction | Model with best schema compliance | Parsing reliability is non-negotiable |
| Latency-critical | Fastest model that meets quality bar | User experience trumps marginal quality |
| Fallback | Second provider | Availability protection |
A routing layer adds complexity, but not much. An if statement or a config-driven switch is enough to start. You don’t need an ML-based router. You need a decision tree grounded in your bake-off results.
One team I worked with went from a single frontier model to a two-model router and cut monthly spend by 60% with no measurable quality regression. The hard part was running the bake-off. The router itself was 50 lines of Go.
Open Models: When and When Not
Self-hosting is a real option now. Llama 3 and Mistral variants are genuinely capable. But the question isn’t “can it do the task?” It’s “do we want to own the infrastructure?”
Self-host when:
- Data must not leave your network (regulatory, contractual)
- Volume is high and predictable enough that fixed GPU costs beat per-token pricing
- You need fine-tuning that hosted APIs don’t support
- You need offline or air-gapped operation
Don’t self-host when:
- Volume is bursty or growing unpredictably
- You need frontier capability that open models haven’t matched yet
- Your team doesn’t have GPU ops experience
- You want to iterate model versions quickly
I’ve talked a few teams out of self-hosting after running the numbers. The GPU costs plus ops burden plus slower iteration cycle made the total cost higher than the hosted API they were trying to replace. Self-hosting is a capability decision as much as a cost decision.
Contracts and Pricing: Check the Fine Print
Pricing shifts fast. What I can tell you as of late 2024:
- The spread between frontier and mid-tier models is 10-30x on a per-token basis
- Total cost is dominated by usage patterns (retries, context size, output length), not headline price
- Enterprise agreements often include committed-use discounts that change the math significantly
- Rate limits and quotas vary by tier and can cap throughput during peak usage
Verify current rates directly with providers before locking in. A pricing comparison that’s two months old is already stale.
The Only Advice That Ages Well
There’s no universal winner. Run a focused bake-off on your tasks, build a simple router, monitor everything, and re-evaluate quarterly. The model landscape moves fast. Your selection process should be fast too.
Treat vendor claims and public benchmarks as starting points, not decisions. The evaluation set built from your actual prompts, reviewed by your team, is the only benchmark that matters.