Caching: The Easy Part Is Adding It, the Hard Part Is Everything …

Quick take

Cache-aside is the sane default. Invalidation is where everyone gets burned. Plan for stampedes and penetration on day one, not after the incident. And please – document what staleness window your product can tolerate before you write a single line of caching code.

Phil Karlton’s line about cache invalidation being one of the two hard things in computer science gets quoted constantly. It gets quoted because it’s true. I’ve added caching to systems at a large consumer platform, at a real-time messaging company, and at Decloud. Every time, the initial implementation was straightforward. Every time, the edge cases in invalidation and consistency were where we spent the real engineering effort.

Caching isn’t hard to implement. It’s hard to operate correctly.

What Is Worth Caching

Not everything. The filter is simple:

High read-to-write ratio. If you’re reading 100x more than writing, caching pays off immediately.
Expensive to compute or fetch. Database aggregations, external API calls, anything with meaningful latency.
Tolerant of staleness. A product listing that’s 30 seconds stale is fine. An account balance that’s 30 seconds stale isn’t.

If data changes constantly and correctness matters, don’t cache it. Add an index, optimize the query, or scale the database.

Core Patterns

Cache-Aside

The application checks the cache, falls back to the source on a miss, and populates the cache for next time. This is the default for good reason: it’s simple, it works with any data store, and the application controls the logic.

In Go, a basic cache-aside with singleflight to prevent stampedes:

import "golang.org/x/sync/singleflight"

type UserCache struct {
    redis  *redis.Client
    db     *sql.DB
    group  singleflight.Group
    ttl    time.Duration
}

func (c *UserCache) Get(ctx context.Context, id string) (*User, error) {
    // Check cache first
    data, err := c.redis.Get(ctx, "user:"+id).Bytes()
    if err == nil {
        var u User
        if err := json.Unmarshal(data, &u); err == nil {
            return &u, nil
        }
    }

    // Singleflight: only one goroutine fetches from DB for a given key
    val, err, _ := c.group.Do("user:"+id, func() (interface{}, error) {
        u, err := c.fetchFromDB(ctx, id)
        if err != nil {
            return nil, err
        }

        // Populate cache
        data, _ := json.Marshal(u)
        c.redis.Set(ctx, "user:"+id, data, c.ttl)
        return u, nil
    })
    if err != nil {
        return nil, err
    }

    u := val.(*User)
    return u, nil
}

The singleflight.Group is the key detail here. Without it, a hot key expiry sends N concurrent requests to the database. With it, one goroutine fetches while the others wait and share the result. I’ve seen this single change reduce database load by 10x during peak traffic.

Write-Through

Writes update the cache and the primary store together. Reads are always fast and consistent. The tradeoff is higher write latency because you’re writing to two places.

func (c *UserCache) Update(ctx context.Context, u *User) error {
    if err := c.updateDB(ctx, u); err != nil {
        return err
    }

    data, err := json.Marshal(u)
    if err != nil {
        return err
    }

    return c.redis.Set(ctx, "user:"+u.ID, data, c.ttl).Err()
}

Good fit when writes are moderate and you can’t tolerate stale reads. At a large consumer platform, we used write-through for restaurant availability – it changed infrequently but needed to be correct immediately.

Write-Behind

Writes go to the cache. A background process persists them asynchronously. This absorbs write spikes but introduces data loss risk if the cache crashes before flushing.

I’m cautious about write-behind. It’s appropriate for data you can rebuild – session state, analytics counters, things where losing a few seconds of writes is acceptable. For anything transactional, avoid it.

Invalidation

Three approaches, and most production systems combine two of them.

TTL-based expiration. The safety net. Every cached item has a time to live. When it expires, the next read triggers a refresh. Simple, reliable, and coarse. A 60-second TTL means your data can be up to 60 seconds stale. That’s the contract.

Event-based invalidation. When the source data changes, publish an event that invalidates or updates the cache. More precise than TTL, but it depends on reliable event delivery. If the invalidation event gets lost, the cache serves stale data until the TTL expires.

func (c *UserCache) HandleUserUpdated(ctx context.Context, event UserUpdatedEvent) error {
    return c.redis.Del(ctx, "user:"+event.UserID).Err()
}

Delete on invalidation rather than update. Let the next read repopulate. This avoids race conditions between the event handler and concurrent reads.

Versioned keys. Instead of user:123, use user:123:v7. When the data changes, increment the version. Old keys age out naturally. No explicit deletes, no race conditions. The cost is more keys in the cache and a version lookup on every read.

My default: TTL as the safety net plus event-based invalidation for known write paths. This covers 90% of use cases.

Failure Modes

Stampedes

Hot key expires. Hundreds of concurrent requests miss the cache and hit the database. The database buckles.

Mitigations:

Singleflight (shown above) – one fetch per key at a time
Early refresh – refresh the cache before the TTL expires, while the current value is still valid
Stale-while-revalidate – serve the slightly stale value while refreshing in the background

Penetration

Requests for keys that will never exist in the source. Every request misses the cache and hits the database. Attackers love this.

Fix: cache negative results with a short TTL. If user:999 doesn’t exist, cache nil for 30 seconds. A Bloom filter in front of the cache can also help for large keyspaces.

Inconsistency

The cache says one thing. The database says another. This happens for all sorts of reasons: failed invalidation events, race conditions between writes and cache updates, network partitions between the application and Redis.

Accept eventual consistency where the product allows it. For the few paths that must be strongly consistent, bypass the cache and read from the source.

Multi-Level Caching

In-process memory for the hottest data. Redis for shared distributed caching. CDN for public static content. Each layer reduces latency but adds a place where stale data can hide.

Keep the hierarchy as shallow as your latency requirements allow. Two levels is common. Three is manageable. Four is a debugging nightmare.

Observability

You need four metrics at minimum:

Hit rate – below 80% and you should question whether the cache is helping
Latency – cache should be sub-millisecond; if it isn’t, check network or serialization
Eviction rate – high evictions mean the cache is too small or the TTLs are too long
Error rate – a cache error should fall back to the source, not fail the request

Document the expected staleness window for each cached entity. “User profiles: up to 60 seconds stale” is a product decision, not a technical one. Make it explicit.

The Honest Summary

Caching is the fastest way to scale reads. It’s also the fastest way to introduce subtle bugs that only show up under load. Pick cache-aside as your default. Use singleflight. Set TTLs that match your product’s tolerance for staleness. Plan for stampedes and penetration before they happen.

And remember: the best cache is the one you can explain to the on-call engineer at 3 AM.

Caching: The Easy Part Is Adding It, the Hard Part Is Everything Else