When to Go Async (And When to Resist the Urge)

Quick take

Async is powerful. It’s also a complexity multiplier. Use task queues for background work, pub/sub for fan-out, event streams for replay. Make everything idempotent. If the user is staring at a spinner waiting for the result, just keep it synchronous.

At Decloud, we went through the classic async adoption arc. Started with a synchronous monolith. Hit scaling walls. Reached for message queues. Over-applied them. Spent six months debugging ordering issues and silent message drops before finding the right balance.

The lesson wasn’t “async is bad.” It was “async is a tool with a sharp edge, and you need to respect the edge.”

When Async Actually Helps

Synchronous request-response couples you to the latency and availability of every service in the chain. One slow dependency and the whole request stalls. Under load, it cascades.

Async breaks that chain. Good use cases:

Slow or variable work. Sending notifications, generating reports, processing uploads. At Decloud, we moved PDF generation off the request path and into a task queue. Response times dropped from seconds to milliseconds. Users got a “processing” state and a webhook when it was done.

Traffic spikes. A queue absorbs bursts that would overwhelm a synchronous system. The work gets done at the pace the workers can handle, not at the pace the traffic arrives.

Cross-team boundaries. When the order service and the notification service are owned by different teams with different deploy schedules, an event bus is the natural seam. Neither team needs to know the other’s implementation details.

When to resist: if the user is waiting for the result in the same HTTP request, keep it synchronous. Bolting async onto a path that needs a real-time answer adds latency, complexity, and a worse user experience.

The Three Shapes

Task Queues

One message, one consumer. Work gets distributed across a pool of workers. Good for background jobs, batch processing, anything where you need to smooth out load. RabbitMQ, SQS, Redis-based queues – the tooling is mature.

Pub/Sub

One event, many consumers. The billing service, the analytics service, and the notification service all react to “order placed” independently. Each consumer moves at its own pace. If one falls behind, the others aren’t affected.

We used this heavily at Decloud for cross-service communication. The critical insight: keep the event payload lean. Include enough data for consumers to act without calling back to the producer. That reduces coupling and eliminates a class of circular dependency bugs.

Event Streams

Kafka-style ordered logs. Consumers track their position, which makes replay and backfill possible. Ordering holds within a partition key – so design your keys carefully. We partitioned by tenant ID at Decloud, which gave us per-tenant ordering without global ordering overhead.

Delivery and Idempotency

Three delivery semantics exist in theory:

At-most-once: fast, messages can be lost. Fine for metrics, bad for money.
At-least-once: reliable, but duplicates happen. This is the practical default.
Exactly-once: clean but expensive. Kafka supports it within its ecosystem. Across service boundaries, it’s a fantasy.

At-least-once means your consumers will see duplicates. Make them idempotent. Every event gets a stable ID. Every consumer records what it has processed. If it sees the same ID twice, it skips. This is non-negotiable.

At Decloud, we skipped idempotency on a notification consumer because “how bad could duplicate emails be?” Bad. The answer is bad. A retry storm sent 47 copies of a password reset email to the same user. We fixed that quickly.

Ordering

Async systems are eventually consistent. If order matters, route related messages through the same partition key. Include a sequence number or version in the payload so consumers can detect and handle out-of-order delivery.

If you need strict global ordering across unrelated entities, async is probably the wrong tool. Use a database.

Reliability

The difference between a fragile async system and a resilient one:

Bounded retries with backoff. Not infinite retries. Not immediate retries. Exponential backoff with a cap and a dead-letter queue for messages that can’t be processed after N attempts.
Dead-letter queues. Every queue needs one. A message that fails repeatedly should land somewhere visible, not disappear. Check the DLQ daily. Automate alerts on depth.
Timeouts. A consumer that hangs on a message blocks the queue. Set processing timeouts and handle them.

Observability

Async systems fail silently. That’s their worst property. You need:

Correlation IDs on every message so you can trace a flow across services
Queue depth and consumer lag metrics – if lag is growing, something is wrong
Per-consumer error rates and processing latency

If you can’t answer “how far behind is consumer X” in under a minute, your async system is a black box.

Traps I’ve Fallen Into

Unbounded queues that hid a downstream outage for hours. The backlog grew to millions. Recovery took longer than the outage.
Missing idempotency that turned retries into duplicate data. See: the 47 emails.
Synchronous waits on async results. If you’re polling a queue for a response to return to the user, you have reinvented synchronous calls with extra steps.
Ignoring ordering and getting nondeterministic bugs that only reproduce under load.

The Honest Tradeoff

Async architecture buys you decoupling, burst absorption, and independent scaling. It costs you debuggability, operational complexity, and eventual consistency headaches.

Use it where the benefit is clear. Resist it where synchronous is simpler. The teams that get this right aren’t the ones with the most Kafka clusters. They’re the ones that can explain exactly why each message flow exists and what happens when it breaks.