Your LLM Bill Is Your Own Fault

| 4 min read |
ai cost-optimization llm engineering

Everyone's complaining about LLM costs. Almost nobody has done the basics: caching, model routing, or even measuring what they're spending per feature.

I got a call last week from a team that was “shocked” their OpenAI bill hit $14,000 in June. They’re a 12-person startup. I asked three questions:

  1. Do you cache any responses? No.
  2. Do you use GPT-4 for everything or route to 3.5 for simpler tasks? Everything goes to GPT-4.
  3. Do you set max_tokens on your completions? No, they “didn’t want to cut off the output.”

This isn’t a cost optimization problem. This is a “nobody thought about it for five minutes” problem.

The FinOps-for-AI grift

I’ve already seen three startups pitching “FinOps for AI” – dashboards, alerts, recommendations, the whole cloud cost management playbook repackaged for LLM spend. I’ve strong feelings about this.

Your LLM costs aren’t a monitoring problem. They’re an architecture problem. You don’t need a dashboard to tell you that sending 4,000-token prompts to GPT-4 for a classification task that GPT-3.5-turbo handles fine is wasteful. You need an engineer to spend an afternoon thinking about it.

I spent years in the telecom space. The cloud cost management industry exists because organizations got too big and too decoupled for anyone to own the bill. A 12-person startup doesn’t have that problem. You have one API key. Look at the usage page. It’s right there.

The stuff that actually moves the needle

Here’s what I tell every team that asks me about LLM costs. It takes about two days to implement all of it, and it usually cuts the bill by 40-70%.

Route by task complexity. This is the single biggest lever. Most LLM workloads are a mix of simple tasks (classification, extraction, formatting) and hard tasks (reasoning, creative generation, complex analysis). GPT-3.5-turbo handles the simple stuff at 1/20th the cost. Build a router. It can be as dumb as a switch statement on the task type. Doesn’t need to be fancy.

Cache deterministic requests. If the same input produces the same acceptable output, cache it. At a fintech company I worked with previously, we had an endpoint that was calling the LLM to classify transaction types. The same transaction descriptions kept coming through. A Redis cache with a 24-hour TTL cut that endpoint’s LLM calls by 60% in the first week.

Trim your prompts. Prompts grow like code comments – they accrete. Someone adds “also make sure to…” and nobody removes the instruction that became redundant three iterations ago. I’ve seen production prompts that were 2,000 tokens of instructions for a task that needed 200. Audit them quarterly. Shorter prompts are cheaper and usually produce better output.

Set max_tokens. Always. If you expect a 50-word response, don’t let the model ramble for 500 tokens. This seems obvious. It’s apparently not obvious, because I keep seeing it.

Batch where possible. If you’re processing 100 items, don’t make 100 API calls. Combine them: “Classify each of the following items…” One call, one response. Works great for classification, extraction, and summarization. Doesn’t work for complex reasoning tasks where items are independent.

The cost model is simple

Your monthly LLM bill is roughly:

requests * avg_tokens_per_request * price_per_token

That’s three levers. Reduce any of them and the bill drops. The reason people’s bills surprise them is they don’t track any of these per feature. They see one big number at the end of the month and panic.

Track cost per feature. Not per API key, not per team. Per feature. “The document summarizer costs $X per day. The classification endpoint costs $Y per day.” Now you can have an actual conversation about whether the value justifies the cost.

What actually annoys me

The thing that gets under my skin is people treating LLM costs as this novel, complex problem that requires new tools and new thinking. It doesn’t. It’s the same cost engineering we’ve done forever. You measure, you identify waste, you optimize the hot paths, and you set budgets.

The only new wrinkle is that LLM costs scale with input/output volume in a way that’s more linear and more visible than traditional compute costs. That actually makes it easier to optimize, not harder. Every request has a clear cost. You don’t need to guess about amortized instance hours or reserved capacity pricing.

If your LLM bill is out of control, you don’t need a “FinOps for AI” platform. You need to spend a day implementing caching and model routing. Then spend 30 minutes a month reviewing the usage dashboard that OpenAI already gives you for free.

This isn’t rocket science. It’s engineering discipline. The same discipline that keeps your AWS bill sane keeps your LLM bill sane. We just collectively forgot it because the technology is new and shiny.