Picking an AI Model for Production (Late 2024)

| 5 min read |
ai models comparison llm

There's no best model. There's the model that fits your workload, latency budget, cost constraint, and ops tolerance. Here's how to compare them.

Quick take

Benchmarks mislead. The right model depends on your tasks, latency requirements, cost tolerance, and how much ops overhead you can absorb. Run a bake-off on your actual workload. Route between models. Stop looking for a universal winner.


I get asked “which model should we use?” at least once a week. The answer is always the same: it depends. That answer always disappoints, so let me make it useful.

The late-2024 model landscape is competitive enough that the gap between top-tier providers on general tasks is small. The differences that matter most are operational, not intellectual. Here’s how I think about model selection for production systems.

The Landscape at a Glance

Two tracks dominate. Hosted APIs from Anthropic, OpenAI, and Google iterate fast and are easiest to ship with. Open-weight models from Meta (Llama), Mistral, and others give you more control but come with infrastructure baggage.

TrackStrengthsWeaknessesBest For
Hosted API (frontier)Latest capability, zero ops, fast iterationCost at scale, vendor dependency, data leaves your infraMost teams starting out, complex reasoning tasks
Hosted API (mid-tier)Good cost/quality ratio, same deployment simplicityWeaker on complex tasks, less controllableHigh-volume simple tasks, routing targets
Open-weight (large)Data control, no per-token cost at scale, fine-tunableGPU costs, ops burden, slower model updatesHigh volume, data residency, offline
Open-weight (small)Fast inference, cheap, embeddableLimited capability, more prompt engineeringClassification, extraction, edge deployment

What to Actually Compare

Forget leaderboards. They’re narrow, gameable, and rarely match your workload. Here are the dimensions that matter in production:

DimensionWhat to MeasureWhy It Matters
Task fitSuccess rate on your actual promptsA model that aces coding benchmarks might fail your extraction tasks
Latencyp50 and p95 with realistic prompt sizesAverage latency hides tail problems that users feel
Cost per successTotal spend per completed task, including retriesCheap per-token doesn’t mean cheap per-task
Structured outputJSON/schema compliance rateCritical if downstream code parses the response
Tool useAccuracy of function calling and parameter extractionBad tool calls are worse than no tool calls
Safety/controllabilityRefusal rates, policy adherence, output consistencyToo permissive or too restrictive both cause problems
Context handlingQuality at 8k, 32k, 128k+ tokensLong context support isn’t the same as long context quality

I’ve run these comparisons for teams I’ve worked with. The results consistently surprise people. The “best” model on paper is rarely the best model for their specific tasks.

How to Run a Bake-Off

Don’t spend a month on this. A focused bake-off should take a few days:

  1. Pick 30-50 representative inputs from your actual workload. Cover the common cases and the hard cases.
  2. Define success criteria for each one. Not vibes, specific, checkable criteria.
  3. Run each model against the same inputs with the same system prompt.
  4. Score each model by task success rate, latency, and cost.
  5. Check structured output compliance if you depend on it.

The results won’t be close on every dimension. One model will be cheaper. Another will be more accurate on complex tasks. A third will have better latency. That’s the point – you’re mapping the tradeoff space, not finding a winner.

The Router Pattern

Once you have bake-off data, the next step is obvious: route different task types to different models.

Task TypeRoute ToRationale
Simple classification / extractionSmall or mid-tier modelHigh volume, accuracy is sufficient, saves 60-80%
Complex reasoning / generationFrontier modelQuality matters, volume is lower
Structured data extractionModel with best schema complianceParsing reliability is non-negotiable
Latency-criticalFastest model that meets quality barUser experience trumps marginal quality
FallbackSecond providerAvailability protection

A routing layer adds complexity, but not much. An if statement or a config-driven switch is enough to start. You don’t need an ML-based router. You need a decision tree grounded in your bake-off results.

One team I worked with went from a single frontier model to a two-model router and cut monthly spend by 60% with no measurable quality regression. The hard part was running the bake-off. The router itself was 50 lines of Go.

Open Models: When and When Not

Self-hosting is a real option now. Llama 3 and Mistral variants are genuinely capable. But the question isn’t “can it do the task?” It’s “do we want to own the infrastructure?”

Self-host when:

  • Data must not leave your network (regulatory, contractual)
  • Volume is high and predictable enough that fixed GPU costs beat per-token pricing
  • You need fine-tuning that hosted APIs don’t support
  • You need offline or air-gapped operation

Don’t self-host when:

  • Volume is bursty or growing unpredictably
  • You need frontier capability that open models haven’t matched yet
  • Your team doesn’t have GPU ops experience
  • You want to iterate model versions quickly

I’ve talked a few teams out of self-hosting after running the numbers. The GPU costs plus ops burden plus slower iteration cycle made the total cost higher than the hosted API they were trying to replace. Self-hosting is a capability decision as much as a cost decision.

Contracts and Pricing: Check the Fine Print

Pricing shifts fast. What I can tell you as of late 2024:

  • The spread between frontier and mid-tier models is 10-30x on a per-token basis
  • Total cost is dominated by usage patterns (retries, context size, output length), not headline price
  • Enterprise agreements often include committed-use discounts that change the math significantly
  • Rate limits and quotas vary by tier and can cap throughput during peak usage

Verify current rates directly with providers before locking in. A pricing comparison that’s two months old is already stale.

The Only Advice That Ages Well

There’s no universal winner. Run a focused bake-off on your tasks, build a simple router, monitor everything, and re-evaluate quarterly. The model landscape moves fast. Your selection process should be fast too.

Treat vendor claims and public benchmarks as starting points, not decisions. The evaluation set built from your actual prompts, reviewed by your team, is the only benchmark that matters.