The Best Model Is the Smallest One That Works

The default instinct when building with LLMs is to reach for the biggest model available. I get it. When you don’t know exactly what you need, the biggest model feels like the safest bet. But “safest bet” and “right choice” are not the same thing.

Most production LLM tasks I see are classification, extraction, formatting, and short generation. Intent routing for a support bot. Extracting structured data from emails. Labeling inbound requests. These don’t need GPT-4 or Claude Opus. They need a model that’s fast, cheap, and predictable.

A small model running a well-scoped task will beat a large model running a vague one. Every time.

Where small wins

Small models shine when the output space is narrow and the success criteria are clear. If you can describe the correct answer format in one sentence, a small model can probably handle it: classification with a fixed label set, entity extraction with a defined schema, or reformatting text from one structure to another.

The advantages are not marginal. A Haiku-class model might respond in 200ms at a fraction of a cent per request. The same task on a frontier model might take 2 seconds and cost 10x more. At scale, that difference is the gap between a sustainable product and one that burns through runway.

I switched an intent router from GPT-4 to a small model last month. Accuracy stayed within 1%. Latency dropped 80%. Monthly inference cost dropped from $12K to under $2K. The engineering effort was two days of prompt tuning and evaluation.

Where small fails

Small models fall apart when the task requires multi-step reasoning, nuanced judgment, or long-form coherence. If you need a model to read a 10-page contract and identify three specific risks, it will miss things. If you need it to write a persuasive email that matches a specific executive’s tone, it will usually produce something generic.

The failure mode is subtle. Small models don’t refuse – they confidently produce mediocre output. You won’t see errors. You’ll see output that’s 80% right and 20% subtly wrong in ways that are hard to catch without careful evaluation.

The routing pattern

The most cost-effective architecture I’ve built is a two-tier system. Small model handles the 90% of requests that are well-scoped and predictable. Large model handles the 10% that need depth.

Route by complexity, not by topic. A billing question that maps to one of five categories goes to the small model. A billing dispute that requires reading context and making a judgment call goes to the large model. The router itself can be a small model – it’s just a classification task.

This is not novel. It is the same pattern as having junior engineers handle tickets and escalating to seniors. The model is the same. The economics are the same. Route smart, spend less.

Pick the smallest model that clears the bar

Don’t start with the biggest model and optimize later. Start with the smallest model and prove it’s insufficient before upgrading. You’ll be surprised how often “insufficient” never arrives.

The best model isn’t the smartest one. It’s the smallest one that meets your quality bar, at a cost and latency you can sustain.

The Best Model Is the Smallest One That Works

Where small wins

Where small fails

The routing pattern

Pick the smallest model that clears the bar

Assumptions

Limits

References