Picking an AI Model for Production (Late 2024)

Quick take

Benchmarks mislead. The right model depends on your tasks, latency requirements, cost tolerance, and how much ops overhead you can absorb. Run a bake-off on your actual workload. Route between models. Stop looking for a universal winner.

I get asked “which model should we use?” at least once a week. The answer is always the same: it depends. That answer always disappoints, so let me make it useful.

The late-2024 model landscape is competitive enough that the gap between top-tier providers on general tasks is small. The differences that matter most are operational, not intellectual. Here’s how I think about model selection for production systems.

The Landscape at a Glance

Two tracks dominate. Hosted APIs from Anthropic, OpenAI, and Google iterate fast and are easiest to ship with. Open-weight models from Meta (Llama), Mistral, and others give you more control but come with infrastructure baggage.

Track	Strengths	Weaknesses	Best For
Hosted API (frontier)	Latest capability, zero ops, fast iteration	Cost at scale, vendor dependency, data leaves your infra	Most teams starting out, complex reasoning tasks
Hosted API (mid-tier)	Good cost/quality ratio, same deployment simplicity	Weaker on complex tasks, less controllable	High-volume simple tasks, routing targets
Open-weight (large)	Data control, no per-token cost at scale, fine-tunable	GPU costs, ops burden, slower model updates	High volume, data residency, offline
Open-weight (small)	Fast inference, cheap, embeddable	Limited capability, more prompt engineering	Classification, extraction, edge deployment

What to Actually Compare

Forget leaderboards. They’re narrow, gameable, and rarely match your workload. Here are the dimensions that matter in production:

Dimension	What to Measure	Why It Matters
Task fit	Success rate on your actual prompts	A model that aces coding benchmarks might fail your extraction tasks
Latency	p50 and p95 with realistic prompt sizes	Average latency hides tail problems that users feel
Cost per success	Total spend per completed task, including retries	Cheap per-token doesn’t mean cheap per-task
Structured output	JSON/schema compliance rate	Critical if downstream code parses the response
Tool use	Accuracy of function calling and parameter extraction	Bad tool calls are worse than no tool calls
Safety/controllability	Refusal rates, policy adherence, output consistency	Too permissive or too restrictive both cause problems
Context handling	Quality at 8k, 32k, 128k+ tokens	Long context support isn’t the same as long context quality

I’ve run these comparisons for teams I’ve worked with. The results consistently surprise people. The “best” model on paper is rarely the best model for their specific tasks.

How to Run a Bake-Off

Don’t spend a month on this. A focused bake-off should take a few days:

Pick 30-50 representative inputs from your actual workload. Cover the common cases and the hard cases.
Define success criteria for each one. Not vibes, specific, checkable criteria.
Run each model against the same inputs with the same system prompt.
Score each model by task success rate, latency, and cost.
Check structured output compliance if you depend on it.

The results won’t be close on every dimension. One model will be cheaper. Another will be more accurate on complex tasks. A third will have better latency. That’s the point – you’re mapping the tradeoff space, not finding a winner.

The Router Pattern

Once you have bake-off data, the next step is obvious: route different task types to different models.

Task Type	Route To	Rationale
Simple classification / extraction	Small or mid-tier model	High volume, accuracy is sufficient, saves 60-80%
Complex reasoning / generation	Frontier model	Quality matters, volume is lower
Structured data extraction	Model with best schema compliance	Parsing reliability is non-negotiable
Latency-critical	Fastest model that meets quality bar	User experience trumps marginal quality
Fallback	Second provider	Availability protection

A routing layer adds complexity, but not much. An if statement or a config-driven switch is enough to start. You don’t need an ML-based router. You need a decision tree grounded in your bake-off results.

One team I worked with went from a single frontier model to a two-model router and cut monthly spend by 60% with no measurable quality regression. The hard part was running the bake-off. The router itself was 50 lines of Go.

Open Models: When and When Not

Self-hosting is a real option now. Llama 3 and Mistral variants are genuinely capable. But the question isn’t “can it do the task?” It’s “do we want to own the infrastructure?”

Self-host when:

Data must not leave your network (regulatory, contractual)
Volume is high and predictable enough that fixed GPU costs beat per-token pricing
You need fine-tuning that hosted APIs don’t support
You need offline or air-gapped operation

Don’t self-host when:

Volume is bursty or growing unpredictably
You need frontier capability that open models haven’t matched yet
Your team doesn’t have GPU ops experience
You want to iterate model versions quickly

I’ve talked a few teams out of self-hosting after running the numbers. The GPU costs plus ops burden plus slower iteration cycle made the total cost higher than the hosted API they were trying to replace. Self-hosting is a capability decision as much as a cost decision.

Contracts and Pricing: Check the Fine Print

Pricing shifts fast. What I can tell you as of late 2024:

The spread between frontier and mid-tier models is 10-30x on a per-token basis
Total cost is dominated by usage patterns (retries, context size, output length), not headline price
Enterprise agreements often include committed-use discounts that change the math significantly
Rate limits and quotas vary by tier and can cap throughput during peak usage

Verify current rates directly with providers before locking in. A pricing comparison that’s two months old is already stale.

The Only Advice That Ages Well

There’s no universal winner. Run a focused bake-off on your tasks, build a simple router, monitor everything, and re-evaluate quarterly. The model landscape moves fast. Your selection process should be fast too.

Treat vendor claims and public benchmarks as starting points, not decisions. The evaluation set built from your actual prompts, reviewed by your team, is the only benchmark that matters.