Distributed Systems Patterns I Keep Reaching For

Quick take

Most distributed systems advice reads like a textbook. This is the shortlist I actually use. Timeouts, retries, circuit breakers, sagas, outbox/inbox, and backpressure – applied with discipline, not ceremony.

I’ve built and operated distributed systems at Verizon, AT&T, Decloud, and most recently at a large consumer platform. The failure modes are remarkably consistent across all of them. Network partitions, cascading timeouts, retry storms, stale caches. The systems that survive aren’t the clever ones. They’re the ones with boring, well-applied patterns.

This isn’t a catalog. It’s the set of patterns I keep reaching for, along with real tradeoffs and code where it helps.

Failure Handling

Timeouts and Deadlines

Every remote call without a timeout is a bug waiting to happen. I learned this the hard way at Decloud – a single downstream service that started responding in 30 seconds instead of 300 milliseconds brought down our entire checkout flow. No timeout, no deadline propagation, no circuit breaker. Just threads piling up until the JVM ran out of memory.

Propagate deadlines end to end. If the user’s request has 2 seconds left, the downstream call should know that. In Go, this is context.WithTimeout and it works beautifully:

func fetchUser(ctx context.Context, id string) (*User, error) {
    ctx, cancel := context.WithTimeout(ctx, 500*time.Millisecond)
    defer cancel()

    req, err := http.NewRequestWithContext(ctx, "GET", userServiceURL+"/"+id, nil)
    if err != nil {
        return nil, err
    }

    resp, err := httpClient.Do(req)
    if err != nil {
        return nil, fmt.Errorf("fetch user %s: %w", id, err)
    }
    defer resp.Body.Close()

    var u User
    return &u, json.NewDecoder(resp.Body).Decode(&u)
}

The caller sets the budget. The callee respects it. Simple.

Retry With Backoff and Jitter

Retries fix transient errors. Immediate retries create storms. I’ve seen a single retry loop without jitter generate enough traffic to keep a recovering service down for an extra twenty minutes.

Exponential backoff with full jitter. Cap the attempts. Only retry on errors that are actually transient – a 400 isn’t transient, a 503 probably is.

func retryWithBackoff(ctx context.Context, maxAttempts int, fn func() error) error {
    var err error
    for attempt := 0; attempt < maxAttempts; attempt++ {
        if err = fn(); err == nil {
            return nil
        }

        if attempt == maxAttempts-1 {
            break
        }

        backoff := time.Duration(1<<uint(attempt)) * 100 * time.Millisecond
        jitter := time.Duration(rand.Int63n(int64(backoff)))
        select {
        case <-ctx.Done():
            return ctx.Err()
        case <-time.After(jitter):
        }
    }
    return fmt.Errorf("after %d attempts: %w", maxAttempts, err)
}

Circuit Breaker

After a threshold of errors, stop calling the failing dependency. Just stop. Return a degraded response, a cached value, or an honest error. Let the dependency recover without your traffic making things worse.

The mental model: closed (normal), open (failing, short-circuit), half-open (testing recovery). The implementation doesn’t need to be complex. A counter, a timestamp, and a threshold.

Bulkheads

At a large consumer platform, we had a service that talked to six downstream dependencies through the same HTTP client pool. One slow dependency drained the pool and every other call started timing out too. Classic.

Separate connection pools per dependency. Separate goroutine budgets. The Titanic metaphor is overused but accurate – bulkheads keep one leak from sinking the whole ship.

Rate Limiting

A predictable 429 is better than an unpredictable timeout. Always. Apply rate limits at the edge and between services. I’ll talk more about this in a future post.

Consistency and Transactions

Sagas

Distributed transactions across service boundaries don’t work. Full stop. Two-phase commit sounds great in a database textbook and falls apart the moment you have services owned by different teams with different SLAs.

Sagas replace a global transaction with local transactions and compensating actions. Two flavors:

Choreography: services react to events. Decentralized, but the flow gets hard to trace once you have more than four or five steps. Debugging a choreographed saga during an incident is an exercise in grep and prayer.

Orchestration: a coordinator drives the flow. Easier to understand, easier to audit, easier to debug. The coordinator becomes a critical path, so it needs to be durable. I default to orchestration unless the flow is truly simple.

Outbox and Inbox

You can’t publish an event and update a database atomically with two separate calls. The outbox pattern solves this: write the event to a table in the same transaction as your state change. A separate publisher reads the outbox and sends the events.

The flip side is duplicate delivery. The inbox pattern handles that – record processed event IDs at the consumer and skip repeats. Pair them together. This is the backbone of every reliable event pipeline I’ve built.

Idempotency

Make writes idempotent. Use client-provided idempotency keys. Store the result keyed by that token. If the same request shows up twice, return the same result without doing the work again.

This isn’t optional in a system with retries. If you have retries (and you should), you need idempotency.

CQRS and Read Models

Strong consistency across services is expensive and usually unnecessary. Separate your write model from your read model. Accept eventual consistency for queries, search, and reporting. The read model can be denormalized, optimized, and rebuilt without touching the write path.

Messaging

Events

Events decouple services in time and deployment. Two styles worth knowing:

Event notification: “something happened, look it up if you care.” Lightweight, but consumers need access to the source.
Event-carried state: “something happened, here are the details.” Heavier payload, but consumers are self-sufficient.

Keep schemas versioned and backward compatible. Breaking an event schema in production is one of those things you only do once.

Work Queues

Queues buffer work and let you process at a steady rate. Multiple consumers on the same queue give you horizontal scaling for free. Set visibility timeouts, handle retries explicitly, and always have a dead-letter queue for messages that can’t be processed.

Coordination

Leader Election

Some work must be done by exactly one node. Scheduling, cleanup, deduplication. Use leader election with lease-based expiration. If the leader dies, the lease expires, and someone else takes over.

I’ve seen teams try to avoid leader election by distributing coordination across all nodes. It always ends with split-brain bugs that take weeks to reproduce.

Distributed Locks

Use sparingly. A distributed lock should protect a small, short-lived critical section. If you find yourself holding a lock for seconds, you probably need a different design – partition by key or use a single-writer pattern.

Scaling

Sharding and Consistent Hashing

Partition by key. When you add or remove nodes, consistent hashing minimizes the data that moves. Keep shard ownership explicit and routing predictable. Implicit sharding is debugging hell.

Backpressure

When overloaded, shed work instead of collapsing. Bounded queues, rate limits, and explicit flow control. The system that says “no” gracefully is more reliable than the system that says “yes” and falls over.

This is a cultural thing too. Engineers need to be comfortable returning errors under load instead of trying to serve every request.

Observability

Correlation IDs through every log and event. Latency percentiles (not averages). Error rates per dependency. Queue depth trends. Alert on symptoms – elevated latency, increasing error rates – not on individual errors.

Without observability, you’re guessing. During an incident, guessing is expensive.

The Default Kit

If I’m starting a new service, this is what goes in on day one:

Timeouts and deadline propagation on every remote call
Retries with exponential backoff and jitter for transient errors
Circuit breakers and bulkheads around dependencies
Idempotency keys for writes, inbox deduplication for events
Outbox pattern for reliable event publishing
Rate limiting and backpressure at boundaries
Correlation IDs and service-level metrics

None of this is novel. That’s the point. Distributed systems are messy by nature. Patterns make the mess predictable. Discipline over heroics.