Why I Run Multiple Models in Production

Let me tell you about a fun morning I had last month. A major model provider had a partial outage. Not a full downtime – worse. Elevated latency and intermittent 500s that made the retry logic work overtime without actually resolving anything. The team had bet everything on that one provider. Their AI features were effectively down for four hours.

Another team, running a multi-model setup, barely noticed. Their routing layer shifted traffic to the fallback model within seconds. Quality dipped slightly on complex tasks. Users didn’t complain.

Guess which architecture I recommend now.

The case is boring, and that’s the point

Multi-model isn’t about chasing the latest release or playing model arbitrage. It’s about the same boring infrastructure principles we’ve applied to databases, CDNs, and DNS for decades. Don’t have a single point of failure. Don’t lock yourself into one vendor. Have a plan for when things break.

With LLMs, the failure modes are broader than traditional services. A provider can go down entirely. Latency can spike. A model update can silently change behavior. Rate limits can throttle you during a traffic spike. Any of these will degrade your product if you have no alternative path.

How I think about routing

Routing doesn’t need to be sophisticated. I’ve seen teams over-engineer this with ML-powered classifiers that decide which model gets each request. That’s fun to build and painful to debug.

What works: simple rules based on task type and complexity.

Short classification tasks? Small, fast model. Interactive chat with a paying user? Mid-tier model with good latency. Complex analysis that needs deep reasoning? Big model. Fallback on timeout or error? The next model in the chain.

You can express this in a config file:

routing:
  default: "sonnet"
  rules:
    - task: "classify"
      model: "haiku"
    - task: "analyze"
      complexity: "high"
      model: "opus"
  fallback_chain: ["sonnet", "haiku"]
  timeout_ms: 10000

That’s it. No neural router. No reinforcement learning. Just explicit rules you can read, debug, and change in five minutes.

The key insight: routing is configuration, not code. When a new model drops or pricing changes, you update the config. You don’t refactor a service.

The fallback chain is everything

I can’t stress this enough. Your fallback chain is more important than your primary model choice. Because the primary model will be unavailable at some point.

Keep the chain short – two or three models. Set aggressive timeouts. And critically: log which model actually served each request. If you don’t, you have no idea what quality your users are actually getting. You think they’re getting Opus but half the traffic is silently falling back to Haiku because of rate limits.

I made this mistake early on in a project at a telecom company. We had a fallback in place but no logging on which model served the request. For two weeks, the primary model was rate-limited during peak hours and the fallback was handling 40% of traffic. We didn’t notice until a quality review showed unexpected patterns. Now I log every routing decision. Non-negotiable.

Cost management as a feature

Multi-model is also the most effective cost control mechanism I’ve found. Instead of running every request through the most capable (and expensive) model, you match model capability to task complexity.

The math is straightforward. If 60% of your requests are simple enough for a small model at one-tenth the cost per token, you just cut your AI spend by roughly half. That’s real money at scale. Working with larger companies always surfaces this – teams are shocked when they see how much they’re spending on GPT-4 for tasks that a 7B model could handle.

What goes wrong

Three failure modes I see repeatedly:

Silent fallbacks. The system falls back gracefully, but nobody knows. Quality degrades slowly. Users get frustrated. By the time someone investigates, there are weeks of bad data.

Stale routing rules. A rule made sense three months ago when Model X was the best at coding tasks. Now Model Y is better and cheaper. But nobody updated the config because nobody owns it.

No cross-model evaluation. Teams evaluate their primary model carefully and treat the fallback as “good enough.” Then the fallback serves 30% of traffic during a bad week and nobody has measured whether it’s actually good enough for those tasks.

The fix for all three is the same: monitor, measure, review. Log every routing decision. Run evals against every model in your chain. Review the routing config monthly. This isn’t exciting work. It’s the work that keeps production systems stable.

Keep it simple

Multi-model doesn’t mean complex. It means intentional. Pick two or three models that cover your cost and capability range. Write routing rules you can read. Log everything. Measure quality per model. Review monthly.

The teams shipping reliable AI features aren’t the ones with the cleverest model selection algorithm. They’re the ones that can swap a model in five minutes, measure the impact in an hour, and roll back in seconds.

That’s the whole strategy. Boring, effective, resilient.