Async Job Processing: Patterns That Saved Us at a Fintech Startup

A Thursday afternoon. We pushed a deploy that bumped our news ingestion worker concurrency from 4 to 12. More throughput, right? Within twenty minutes, our NLP scoring pipeline was backed up with 80,000 unprocessed jobs. The scoring service couldn’t keep up, started timing out, and those timeouts triggered retries. The retries doubled the queue. We were drowning in our own work.

That incident taught me more about async job processing than any blog post or conference talk. I’m going to share the patterns we settled on at the fintech startup – patterns born from real failures, not whiteboards.

Not Everything Belongs in the Background

At the fintech startup, we process financial news from hundreds of sources. Articles come in, get parsed, scored by NLP models, tagged, and routed to user watchlists. None of that should block an API response.

But not all work should be async either. I’ve seen teams shove everything into a queue and then wonder why their system is impossible to debug.

Push to a queue when:

The work is slow or calls external services that flake
The user doesn’t need an immediate answer
You want retries without blocking the request
Timing is flexible

Keep it synchronous when:

The user needs confirmation right now
The operation is fast and reliable
It must be atomic with the request – half-done is worse than not done

Simple rule: if the user is staring at a spinner waiting for your queue to process, you made the wrong call.

The Patterns

Basic Queue and Worker

Nothing fancy. A durable queue sits between the thing that creates work and the thing that does it.

def ingest_article(article):
    db.save(article)
    queue.enqueue("score_article", article_id=article.id)
    return article

@worker.task("score_article")
def score_article(article_id):
    article = db.get_article(article_id)
    scores = nlp.score(article.content)
    db.update_scores(article_id, scores)

We used this exact shape at the fintech startup for the first stage of our pipeline. Article lands, gets saved, job goes on the queue. Worker picks it up, scores it, saves the result. Clean separation.

Fan-Out

One event, many consequences. When a new article is scored, we need to:

Match it against user watchlists
Update topic aggregates
Push notifications to relevant users
Log it for analytics

Each of those is a separate job. If watchlist matching fails, notifications still go out. Isolation is the point. Speed is a side effect.

Pipelines

Our ingestion system is a pipeline. Raw article comes in. First job parses and normalizes it. Second job runs NLP scoring. Third job matches it to watchlists and triggers notifications. Each stage has its own retry logic, its own failure mode, its own queue.

This matters because NLP scoring is slow and sometimes the model service is down. We don’t want a flaky model to block article parsing for everything behind it in the queue.

Priority Queues

Not all work is equal. A user requesting a portfolio refresh needs faster turnaround than a batch reprocessing job running at 3 AM.

queues:
  critical: concurrency: 8
  default:  concurrency: 4
  bulk:     concurrency: 1

We learned this the hard way. Our bulk reprocessing jobs were starving real-time scoring because they shared a queue. Separate queues, separate concurrency limits. Problem solved overnight.

Reliability: The Hard Part

At-Least-Once Is Your Reality

Every queue system I’ve used in production delivers at-least-once. Not exactly-once. Exactly-once is a distributed systems myth that vendors sell and engineers discover is a lie at 2 AM.

So your workers must be idempotent. Running the same job twice should produce the same result, not duplicate side effects.

def score_article(job_id, article_id):
    if db.job_completed(job_id):
        return  # already done, skip
    scores = nlp.score(db.get_article(article_id).content)
    db.save_scores_and_mark_done(article_id, scores, job_id)

At the fintech startup, we scored the same article twice early on and pushed duplicate notifications to users. Not a good look for a financial data platform. The idempotency check is cheap. The duplicate notification isn’t.

Retries With Backoff and Jitter

Retry on transient errors. Don’t retry on bad input – that job will fail forever and clog your queue.

Exponential backoff is table stakes. But add jitter. Without it, all your failed jobs retry at the same instant and you get a thundering herd that takes down the very service you were trying to be gentle with.

We had this exact problem with an upstream news API. Rate limited, all workers backed off to the same retry interval, all hit the API simultaneously, all got rate limited again. Jitter broke the cycle.

Timeouts and Heartbeats

Jobs need time limits. Our NLP scoring sometimes hung when the model service was degraded – not down, just slow enough to hold a connection open forever. A 30-second timeout per job and a heartbeat mechanism let us detect stuck workers and reassign their jobs.

Without this, you get ghost workers. They look alive. They hold a job. They do nothing.

Dead Letter Queues

After N retries, stop. Move the job to a dead letter queue. Inspect it later.

The main queue stays healthy. You get a neat pile of failures to investigate when you have time, not a poisoned queue that blocks everything behind the bad job.

The Transactional Outbox

This one bit us. We’d save an article to the database and enqueue a scoring job. Sometimes the enqueue failed after the database write succeeded. Article saved, never scored. Silent data loss.

The fix is an outbox table:

BEGIN;
INSERT INTO articles (...) VALUES (...);
INSERT INTO outbox (event_type, payload) VALUES ('article_created', ...);
COMMIT;

A poller picks up outbox rows and publishes them to the queue. Both writes are in the same transaction. Either both happen or neither does.

It’s ugly. It works. I’ll take ugly-and-correct over elegant-and-lossy every time.

Job Design Principles

Keep jobs small. A job that does five things is five chances to fail and one giant retry. Break it up.

Pass IDs, not objects. Serialize an article ID, not the entire article. The worker loads fresh data. No stale payloads, smaller queue messages, and if the article was updated between enqueue and processing, you get the latest version.

Log everything useful. Job ID, attempt number, processing time, error category. When something breaks at 3 AM, grep-friendly logs are the difference between a 10-minute fix and a 2-hour investigation.

Operations

You need visibility into your queues from day one. Not “when it gets serious.” Day one.

Watch these:

Queue depth – if it’s growing faster than workers drain it, you have a problem
Age of the oldest message – this tells you about latency better than averages
Retry and failure rates per job type
Dead letter queue size
Worker count and utilization

Alert on queue depth growing, error rate spiking, and dead letter queue crossing a threshold. Everything else is a dashboard.

Picking a Queue (Late 2018 Edition)

I’ve used most of these by now:

Redis-backed (Sidekiq, Celery with Redis): Fast, simple, good enough for most web apps. We used this at the fintech startup for non-critical paths.
RabbitMQ: Better routing, proper acknowledgments, more operational overhead. Good when you need complex topologies.
Kafka: Not really a job queue, but we used it for our high-throughput news ingestion stream. Replay capability is the killer feature.
Managed queues (SQS, Cloud Tasks): Less control, less ops. Fine if you can live with the vendor’s semantics.

Pick the simplest thing that meets your durability and throughput requirements. You can always migrate later. You probably won’t, so pick carefully.

What I Wish I Knew Earlier

Async job processing looks simple on a whiteboard. Queue in, worker out. But production has a way of finding every gap in your design. Jobs run twice. Workers die mid-processing. Queues back up because a downstream service is having a bad day.

The patterns here aren’t clever. Idempotency, backoff, dead letter queues, transactional outboxes – they’re boring. They’re also the difference between a system that handles failure gracefully and one that pages you at 3 AM because 200,000 duplicate notifications just went out to your users.

Build the boring stuff first. Your on-call self will thank you.