I’m tired of seeing teams dump entire documents into a context window because “it supports 128K tokens now,” then wonder why the model ignores their instructions. A bigger window isn’t a bigger brain. It’s a bigger inbox. And like any inbox, when you fill it with noise, important things get lost.
This is a rant. But it’s a rant with actionable advice.
The “just throw it all in” fallacy
Here’s what I keep seeing: a team builds a RAG pipeline that retrieves 20 document chunks for every query. They concatenate everything into the prompt because “more context is better.” The model now has 80K tokens of input, 60K of them irrelevant. The response is slower, more expensive, and, this is the part that kills me, lower quality than if they had sent 5K tokens of relevant context.
Retrieval isn’t free just because the window is big enough to hold it. Every irrelevant token dilutes the signal. The model has to figure out which parts of the context actually matter, and it isn’t always good at that, especially when the relevant information is sandwiched between walls of noise.
I reviewed a system where they were spending $400/day on inference. We cut their context budget by 70%, and quality went up. Not down. Up. The model could finally see the signal instead of drowning in noise.
Budget your context like you budget your infrastructure
You wouldn’t provision 10x the compute you need and call it a day. Don’t do it with context either.
Set a hard budget per request. Something like:
- System prompt: 1-2K tokens (this should be stable and tight)
- Retrieved context: 3-5K tokens max (be aggressive about relevance filtering)
- Conversation history: 2-4K tokens (recent turns verbatim, older turns summarized)
- Reserve: 1K tokens (for the model’s response and any overhead)
That’s 7-12K tokens for most requests. Not 128K. Not even close. And for 90% of production use cases, that’s more than enough.
Teams using 128K tokens per request are either doing something genuinely complex (document analysis, long-form generation) or being lazy. Mostly the latter.
Anchors: the stuff that must never fall out
Some information is non-negotiable. The user’s permissions. The current task definition. Key constraints. Explicit decisions made earlier in the conversation. I call these “anchors.”
Anchors go at the top of the context, every time. They don’t get summarized. They don’t get rotated out. They’re the ground truth that the model needs to respect regardless of how long the conversation gets.
I’ve debugged conversations where the model contradicted an earlier decision because the decision was in a turn that got summarized away. The summary said “the user chose option A” but the model treated it as a suggestion, not a commitment. Anchors prevent this.
Summaries need maintenance
Speaking of summaries: if you’re compressing conversation history into summaries, you need to refresh them. A summary generated 20 turns ago may be inaccurate or incomplete relative to the current state of the conversation.
The pattern I use is simple: keep the last 3-5 turns verbatim. Everything before that gets summarized. Refresh the summary every 10 turns or whenever a significant decision changes. It’s a small amount of extra work, and it prevents a category of bugs that’s extremely difficult to diagnose.
Retrieval is a precision problem, not a recall problem
Most RAG implementations err on the side of including too much. The logic goes: “better to include something irrelevant than to miss something important.” That sounds reasonable until you look at the actual failure modes.
From what I’ve seen, the most common production failure isn’t “the model didn’t have enough context.” It’s “the model had too much context and picked the wrong information.” Over-retrieval causes the model to confidently cite irrelevant passages while ignoring the one paragraph that actually answers the question.
Retrieve less. Filter aggressively. If you aren’t sure a chunk is relevant, leave it out. The model can ask follow-up questions. It can’t unsee irrelevant context.
The real problem is that nobody measures this
Most teams have no idea how their context utilization looks in production. They don’t track average context size, the ratio of relevant to irrelevant tokens, or the correlation between context size and output quality. They just set a max limit and hope for the best.
Instrument your context pipeline. Log the size of each section (system prompt, retrieved context, history, anchors). Track output quality as a function of context size. You’ll almost certainly discover that your sweet spot is much smaller than your current usage.
Bigger windows are a genuine improvement. They let you handle tasks that were impossible before. But for most production workloads, the discipline of managing context well matters more than the ability to stuff more into it.