Message Queues: The Patterns Nobody Tells You About Until 3 AM

Quick take

Every queue system you’ll ever use delivers at-least-once. Plan for duplicates or plan for 3 AM pages. The messaging model matters less than whether your consumers are idempotent.

At the fintech startup we had an event-driven pipeline that ingested financial news from hundreds of sources, scored it with NLP models, and fanned it out to user watchlists. Queues were the backbone. They were also the source of our worst incidents.

One afternoon someone bumped the ingestion worker concurrency thinking “more throughput, right?” Within twenty minutes, the scoring service was buried under 80,000 unprocessed jobs. Timeouts triggered retries. Retries doubled the queue depth. We were drowning in our own work.

That was the day I stopped treating message queues as plumbing and started treating them as architecture. Everything below is what I wish I had known before that afternoon.

When to reach for a queue (and when not to)

Queues aren’t a default. They are a tradeoff. You get decoupling and resilience. You pay with operational complexity and eventual consistency.

Use a queue when:

The work is slow or calls flaky external services
The user doesn’t need an immediate answer
You need to buffer bursty traffic (the fintech startup’s market-open spike was 50x baseline)
Multiple services need the same event with different processing logic

Keep it synchronous when:

The user is staring at a spinner waiting for your response
The operation is fast and reliable enough that a direct call is simpler
It must be atomic with the request – half-done is worse than not done

If your queue exists because someone said “we should decouple this” but nobody can explain what failure mode it prevents, you added complexity for nothing. I’ve made this mistake. It isn’t worth the Kubernetes YAML.

The three models that cover 90% of use cases

Work queue: one job, one worker

The simplest pattern. A producer drops a task on a queue. One of N workers picks it up and processes it.

// Producer
func EnqueueScoringJob(ch *amqp.Channel, articleID string) error {
    body, _ := json.Marshal(ScoringJob{ArticleID: articleID})
    return ch.Publish("", "scoring_jobs", false, false, amqp.Publishing{
        DeliveryMode: amqp.Persistent,
        ContentType:  "application/json",
        Body:         body,
    })
}

// Worker
func handleScoringJob(d amqp.Delivery) {
    var job ScoringJob
    if err := json.Unmarshal(d.Body, &job); err != nil {
        d.Nack(false, false) // bad payload, don't requeue
        return
    }
    if err := scoreArticle(job.ArticleID); err != nil {
        d.Nack(false, true) // transient failure, requeue
        return
    }
    d.Ack(false)
}

This was the workhorse at the fintech startup. Articles land, get saved, scoring job goes on the queue. Worker scores it, saves the result. Clean separation between ingestion and processing.

Fan-out: one event, many consumers

One event triggers work in multiple services that don’t know about each other. At the fintech startup, a scored article needed to:

Match against user watchlists
Update topic aggregates
Push notifications
Feed the analytics pipeline

Each consumer gets its own queue bound to a shared exchange. If watchlist matching fails, notifications still go out. Isolation is the point.

// Publish to a fanout exchange
func PublishArticleScored(ch *amqp.Channel, article Article) error {
    body, _ := json.Marshal(ArticleScoredEvent{
        ArticleID: article.ID,
        Scores:    article.Scores,
        Timestamp: time.Now().UTC(),
    })
    return ch.Publish("article_events", "", false, false, amqp.Publishing{
        DeliveryMode: amqp.Persistent,
        ContentType:  "application/json",
        Body:         body,
    })
}

Each downstream service declares its own queue and binds it to article_events. The publisher has no idea how many consumers exist. That’s the whole value.

Topic routing: selective subscription

Messages carry a routing key. Consumers subscribe to patterns. The publisher stays simple while consumers pick only the events they care about.

orders.created    -> Order Service, Analytics
orders.paid       -> Billing, Analytics
orders.*          -> Audit Log

I used this pattern at Dropbyke for fleet events. The routing key was {city}.{event_type} – seoul.ride_started, london.bike_locked. The operations dashboard subscribed to *.bike_locked. The billing service only cared about *.ride_completed. One exchange, many consumption patterns.

Delivery semantics: the part everyone gets wrong

At-least-once is your reality

Every queue system I’ve used in production delivers at-least-once. Not exactly-once. Exactly-once is a distributed systems unicorn that vendors sell and engineers discover is a lie during an incident.

The implication: your consumers must be idempotent. Running the same message twice must produce the same result, not duplicate side effects.

func handleScoringJob(d amqp.Delivery) {
    var job ScoringJob
    json.Unmarshal(d.Body, &job)

    // Idempotency check: skip if already processed
    if alreadyProcessed(job.ID) {
        d.Ack(false)
        return
    }

    scores := scoreArticle(job.ArticleID)
    saveScoresAndMarkDone(job.ArticleID, scores, job.ID) // atomic
    d.Ack(false)
}

At the fintech startup we scored the same article twice early on and pushed duplicate notifications to users. Not a good look for a financial data platform. The idempotency check is a single database lookup. The duplicate notification destroys user trust. Easy math.

Ack after work, not before

Acking before work completes gives you at-most-once delivery. The job is gone from the queue. If the worker dies mid-processing, so does the job. Silently. I’ve debugged this exact issue and it’s miserable because nothing looks wrong – the queue is empty, the worker is healthy, but work is missing.

Ack after processing completes. Accept the occasional duplicate. Build idempotent consumers. This is the boring answer. It’s also the correct one.

Dead letter queues aren’t optional

After N retries, stop. Move the message to a dead letter queue. The main queue stays healthy. You get a pile of failures to inspect on your own schedule, not a poisoned queue that blocks everything behind the bad message.

// Channel setup with dead letter exchange
args := amqp.Table{
    "x-dead-letter-exchange":    "dlx",
    "x-dead-letter-routing-key": "scoring_jobs.dead",
    "x-message-ttl":             int32(300000), // 5 min TTL
}
ch.QueueDeclare("scoring_jobs", true, false, false, false, args)

Without a DLQ, a single malformed message can block your entire pipeline. I learned this during a Google for Startups event in Seoul in 2018 when I was demoing Dropbyke’s system and a corrupted GPS payload brought the whole queue to a halt. In front of judges. Fun times.

Retries: the thundering herd problem

Exponential backoff is table stakes. But add jitter.

Without jitter, all your failed jobs retry at the same instant. You get a thundering herd that takes down the very service you were trying to be gentle with.

func retryDelay(attempt int) time.Duration {
    base := time.Duration(1<<uint(attempt)) * time.Second
    jitter := time.Duration(rand.Int63n(int64(base / 2)))
    return base + jitter
}
// attempt 0: ~1-1.5s
// attempt 1: ~2-3s
// attempt 2: ~4-6s
// attempt 3: ~8-12s

We had this exact problem at the fintech startup with an upstream news API. All workers backed off to the same retry interval, all hit the API simultaneously, all got rate limited again. Jitter broke the cycle immediately.

Cap your retries. Three to five attempts is usually enough. If it hasn’t worked by then, it isn’t a transient failure – it’s a bug. Send it to the dead letter queue.

Ordering is a tradeoff, not a feature

Strict ordering kills throughput. If you need messages processed in order, you’re limited to a single consumer per partition (Kafka) or a single consumer per queue (RabbitMQ). That’s fine for some use cases. It’s a bottleneck for most.

When I can, I design around it. Timestamps on events. Last-write-wins semantics. Idempotent updates that converge regardless of order. This lets me scale out consumers aggressively without worrying about sequencing.

When ordering genuinely matters – financial transactions, state machines – limit concurrency and accept the throughput cost. Just make sure you’re choosing ordering because you need it, not because you assumed you did.

Operations: what to watch

You need visibility into your queues from day one. Not “when it gets serious.” Day one.

Queue depth – if it grows faster than workers drain it, you’re losing
Age of oldest message – tells you more about latency than any average
Retry rate per job type – high retries on one job type means a bug, not a load problem
Dead letter queue size – should be near zero. If it isn’t, something is systematically broken
Consumer count – if it drops and queue depth rises, a deploy probably killed your workers

Alert on queue depth growth and dead letter queue crossing a threshold. Everything else is a dashboard you check during coffee.

Picking a broker

I’ve used most of these across the fintech startup, Dropbyke, and now Decloud:

RabbitMQ: Strong routing, proper acknowledgments, works well for complex topologies. Our primary at the fintech startup.
Kafka: Not really a job queue. It’s a distributed log. The replay capability is the killer feature. Use it when you need event sourcing or high-throughput streams.
Redis-backed (Machinery in Go, Sidekiq, Celery): Fast, simple, good enough for non-critical paths. Watch out for durability – Redis persistence has tradeoffs.
SQS: Less control, less ops. Fine if you live inside AWS and can tolerate its semantics.

Pick the simplest thing that meets your durability and throughput requirements. You can always migrate later. You probably won’t, so pick carefully.

What I keep getting wrong

After running queues in production across three companies, these are the mistakes I still catch myself making:

Underestimating queue depth during spikes. Market open at the fintech startup was 50x normal volume. “Should be fine” isn’t capacity planning.

Ignoring the operational model until after launch. Deploys, rollbacks, debugging stuck consumers – figure this out before you ship, not during your first incident.

Building abstractions that hide the queue’s behavior. Wrapping RabbitMQ in six layers of indirection feels clean until something breaks and you can’t tell whether the problem is in your code or the broker. Keep the abstraction thin.

Queues look simple on a whiteboard. Producer, arrow, box, arrow, consumer. But production finds every gap in your design. The patterns above aren’t clever. They are boring. Idempotency, backoff, dead letters, fan-out. They are also the difference between a system that handles failure gracefully and one that pages you because 200,000 duplicate notifications just went out.

Build the boring stuff first.