Quick take
Your LLM is answering the same questions repeatedly and you’re paying for every single call. Exact-match caching alone can cut 30-50% of your API spend with zero quality loss. Add semantic caching carefully after that. The hard part isn’t the cache – it’s the key design and invalidation discipline.
I was reviewing API logs last month and found something depressing. About 40% of their LLM requests were functionally identical. Same system prompt, same user question (give or take whitespace), same model. They were paying full price for every single one.
Caching is the most boring and most effective optimization you can make to an LLM application. It isn’t glamorous. It doesn’t involve new models or clever prompt tricks. It just saves money and makes things faster. Here is how I build it in Go.
Start with exact match caching
Don’t get fancy. The first layer is simple: hash the request, check the cache, return the cached response if it exists. This catches identical requests and costs almost nothing to implement.
type CacheKey struct {
Version string `json:"v"`
Model string `json:"model"`
PromptHash string `json:"prompt_hash"`
ToolsHash string `json:"tools_hash"`
ParamsHash string `json:"params_hash"`
}
func NewCacheKey(req LLMRequest) CacheKey {
return CacheKey{
Version: "v1",
Model: req.Model,
PromptHash: sha256Hash(req.SystemPrompt + "\n" + req.UserPrompt),
ToolsHash: sha256Hash(marshalTools(req.Tools)),
ParamsHash: sha256Hash(fmt.Sprintf("%f:%d", req.Temperature, req.MaxTokens)),
}
}
func (k CacheKey) String() string {
b, _ := json.Marshal(k)
return sha256Hash(string(b))
}
func sha256Hash(s string) string {
h := sha256.Sum256([]byte(s))
return hex.EncodeToString(h[:])
}
The key includes everything that can change the output: model, prompt content, tools, and sampling parameters. If any of those differ, you get a different key. If they are all the same, you get a cache hit.
Notice the version field. When you change your key schema – and you will – bump the version. This prevents old entries with a different key structure from colliding with new ones.
The cache layer itself
I keep the cache interface simple so the backing store can be swapped. In production I usually start with Redis. For testing and small deployments, an in-memory LRU works fine.
type LLMCache interface {
Get(ctx context.Context, key string) (*CachedResponse, error)
Set(ctx context.Context, key string, resp *CachedResponse, ttl time.Duration) error
Delete(ctx context.Context, key string) error
}
type CachedResponse struct {
Content string `json:"content"`
Model string `json:"model"`
TokensIn int `json:"tokens_in"`
TokensOut int `json:"tokens_out"`
CachedAt time.Time `json:"cached_at"`
}
func (s *Service) Generate(ctx context.Context, req LLMRequest) (*LLMResponse, error) {
key := NewCacheKey(req).String()
if cached, err := s.cache.Get(ctx, key); err == nil && cached != nil {
s.metrics.CacheHit(req.Model)
return &LLMResponse{
Content: cached.Content,
Model: cached.Model,
FromCache: true,
}, nil
}
s.metrics.CacheMiss(req.Model)
resp, err := s.llmClient.Generate(ctx, req)
if err != nil {
return nil, err
}
cached := &CachedResponse{
Content: resp.Content,
Model: resp.Model,
TokensIn: resp.TokensIn,
TokensOut: resp.TokensOut,
CachedAt: time.Now(),
}
// Fire and forget -- cache write failure should not block the response
go func() {
if setErr := s.cache.Set(context.Background(), key, cached, s.ttlFor(req)); setErr != nil {
s.logger.Warn("cache set failed", "key", key, "error", setErr)
}
}()
return resp, nil
}
A few things to note. The cache write is fire-and-forget. A failed cache write should never block or degrade the response to the user. The FromCache flag on the response is important for monitoring – you need to know what percentage of traffic is served from cache.
TTL strategy
This is where people get it wrong. They set a blanket TTL and call it done. Different content ages at different rates.
func (s *Service) ttlFor(req LLMRequest) time.Duration {
// Responses grounded in static reference data can live longer
if req.HasStaticContext() {
return 24 * time.Hour
}
// Responses involving real-time data should be short-lived
if req.HasLiveDataRetrieval() {
return 5 * time.Minute
}
// Default: conservative TTL
return 1 * time.Hour
}
Static context – like a system prompt explaining how to format output, or reference documentation that changes monthly – can tolerate a long TTL. Responses that depend on live data need short TTLs or no caching at all. When in doubt, err toward shorter TTLs. A cache miss costs money. A stale response costs trust.
Invalidation beyond TTLs
TTLs are your baseline. But you also need event-driven invalidation for cases where you know the cache is stale.
Prompt changes are the big one. Every time you update a system prompt or retrieval pipeline, the old cached responses are wrong. The versioned key handles this naturally – a new prompt produces a new hash, which produces a new key, which misses the cache. Old entries expire on their own TTL.
For data-driven invalidation, I use a simple pattern:
func (s *Service) OnKnowledgeBaseUpdate(ctx context.Context, docIDs []string) {
// Invalidate any cached responses that used these documents
for _, docID := range docIDs {
keys, err := s.cacheIndex.KeysForDocument(ctx, docID)
if err != nil {
s.logger.Error("failed to lookup cache keys for document", "doc_id", docID, "error", err)
continue
}
for _, key := range keys {
_ = s.cache.Delete(ctx, key)
}
}
}
This requires maintaining a secondary index that maps documents to cache keys. It’s more work, but for applications where correctness matters – and it usually does – it’s worth it.
What NOT to cache
Not every response should be cached. I have a short list of exclusions:
- User-specific sensitive responses. Unless your cache has strict tenant isolation, don’t risk serving User A’s response to User B. I’ve seen this bug in production. It’s exactly as bad as it sounds.
- Responses that depend on time-sensitive external state. Stock prices, live inventory, anything where a one-hour-old answer is wrong.
- Creative or generative tasks where variability is the feature. If the user expects a different response each time, caching defeats the purpose.
Measuring what matters
You need four metrics from day one:
- Cache hit rate by request type. Not a global number. A 60% overall hit rate might mean 90% for classification and 10% for analysis. The per-type breakdown tells you where to focus.
- Latency with and without cache. This quantifies the speed improvement and justifies the infrastructure cost.
- Cost savings. Track tokens not consumed due to cache hits. Multiply by your per-token rate. Show this number to whoever pays the bills.
- Quality signals on cached responses. User corrections, retries, and thumbs-down ratings. If cached responses get worse quality signals than fresh ones, your TTL is too long or your keys are too broad.
Roll out behind a flag
Don’t flip caching on for all traffic at once. Use a feature flag. Start with one request type that has high repetition and low sensitivity. Measure hit rate, latency, and quality for a week. Then expand.
When something goes wrong – and something always goes wrong – you want to be able to turn caching off in seconds. A feature flag gives you that.
What matters
Caching isn’t sexy. It isn’t a new model or a clever prompting technique. It’s the same infrastructure discipline we’ve applied to every other expensive external service call for decades. The difference is that LLM calls are expensive enough that a 40% hit rate translates to real savings.
Build the cache. Version your keys. Keep TTLs honest. Monitor quality. The money you save on API calls will pay for a lot of actual engineering work.