What Building Distributed Systems at a Fintech Startup Taught Me About Failure

| 6 min read |
distributed-systems reliability architecture resilience

Hard-won lessons from designing distributed systems that survive real-world failures -- timeouts, retries, bulkheads, and the operational habits that actually keep things running.

A few months into my time as CTO at the fintech startup, our financial news aggregation pipeline went silent. No errors. No alerts. Just… nothing. Users saw stale headlines. The dashboard looked green. Took us twenty minutes to figure out what happened: a single downstream API had started responding with 200 OK and empty bodies. Every service downstream trusted that response, cached the emptiness, and served it proudly. We had built a system that could tolerate crashes, but not a liar.

That incident rewired how I think about distributed systems. The spectacular failures – network partitions, node crashes, disk corruption – those are almost easy. You plan for them because they’re dramatic. The killers are the subtle ones. The slow dependency that doesn’t quite timeout. The clock drift that makes your event ordering nonsensical. The duplicate message that creates two charges on someone’s credit card.

Partial Failure Is the Default State

At the fintech startup we pulled financial data from dozens of sources, ran NLP pipelines, scored relevance, and served personalized feeds. At any given moment, something in that chain was degraded. Not down. Degraded. A source returning stale data. A scoring service running hot. A queue backing up.

Once I accepted that partial failure is the normal operating condition, everything clicked. You don’t design for “what if something breaks.” You design for “something is always broken, how do we still deliver value?”

That means setting explicit targets. Not vague “five nines” aspirations, but concrete numbers: P99 latency under 400ms for feed requests, error rate below 0.1% for content delivery. Then you design every component to hold those targets even when a dependency is misbehaving.

The Patterns That Actually Saved Us

Timeouts. Everywhere. No Exceptions.

I can’t overstate this. Every network call needs a timeout. Not just HTTP requests – database queries, cache lookups, message publishes. All of them.

response = http.get(url, timeout=2)

Two seconds. That’s it. If your financial data provider can’t answer in two seconds, we move on. We had a service early on with no timeout on a third-party API call. That provider had a bad day, responses went from 200ms to 45 seconds, and our entire request pipeline backed up behind it. Thread pools exhausted. Cascading failure across three services. All because of one missing timeout.

Even better: use end-to-end deadlines. If the user’s request has a 3-second budget, propagate that deadline downstream. Don’t let an inner service burn 2.8 seconds and leave nothing for the rest of the chain.

Retries That Don’t Make Things Worse

Retries are a double-edged sword. Done right, they mask transient blips. Done wrong, they turn a struggling service into a dead one.

for attempt in range(max_retries):
    try:
        return call()
    except RetryableError:
        sleep(backoff(attempt))

Exponential backoff with jitter. That jitter part matters – without it, all your retrying clients slam the recovering service at the exact same intervals. Thundering herd. We learned to cap retry budgets too. Three attempts max, then fall back to cached data or a degraded response. Retrying forever is just a distributed denial-of-service attack against yourself.

Circuit Breakers

After the empty-response incident, we added circuit breakers on every external dependency. If a service fails five times in a row, stop calling it. Try again in 30 seconds with a single probe request. If the probe succeeds, gradually let traffic back through.

This is the pattern that prevented the most cascading failures for us. Simple concept, massive impact.

Bulkheads

We ran our NLP pipeline and our content delivery API on shared infrastructure early on. The NLP jobs were CPU-hungry beasts. During a big news event – earnings season, a market crash – the pipeline would spike and starve the API of resources.

Solution: isolation. Separate thread pools, separate connection pools, separate queues. One misbehaving component can’t eat another component’s lunch. Same principle as watertight compartments on a ship. The Titanic metaphor is overused but accurate.

Fallbacks Over Failures

When our relevance scoring service was slow, we didn’t show users an error page. We showed them a feed sorted by recency instead. Less personalized, still useful. When a data source was down, we served cached content with a “last updated” timestamp.

Define what a degraded-but-useful response looks like for every endpoint. Caches, defaults, partial results. Users almost always prefer stale data over no data.

Idempotency

In a system with retries and at-least-once message delivery, duplicate operations are inevitable. Every write operation needs to be safe to execute twice.

POST /orders
Idempotency-Key: 6f9c2c70-4f0e-4f36-8a6a-3b5a9f4e3c2d

We tagged every content ingestion event with a unique key. If the same article arrived twice from different sources or from a retry, the system recognized the duplicate and skipped it. Without this, our feeds would have been full of repeated headlines during any period of instability.

Consistency: Pick Your Battles

Strong consistency across a distributed system is expensive and fragile. We tried it early on with a two-phase commit pattern for content updates. It was slow, it was brittle, and it failed in weird ways under load.

We switched to eventual consistency with explicit conflict resolution. Version numbers on every content record. Last-writer-wins for most fields, merge logic for aggregated scores. Sagas with compensating actions for multi-step workflows – if step three fails, step two gets rolled back automatically.

The key insight: design your conflict resolution strategy before you need it. Under pressure at 2am isn’t when you want to be inventing reconciliation logic.

Operations Are Half the Battle

The best-designed system falls apart without operational discipline.

Health checks: We exposed both liveness (process is running) and readiness (process can serve traffic and reach its dependencies) endpoints. Kubernetes used these to route around unhealthy instances automatically.

Observability: Metrics for rates and latency. Logs for detail. Distributed tracing for following a request across services. After the empty-response incident, we added content-validity checks to our health signals. Green dashboard means nothing if you’re not checking the right things.

Load shedding: When traffic spikes beyond capacity, reject requests explicitly with a 503 rather than letting everything slow to a crawl. Backpressure propagated through the system keeps critical paths alive.

Safe deployments: Canary releases. Feature flags. Fast rollback. We never did big-bang deployments after the first time it went wrong (and it went wrong immediately).

Test for Failure, Not Just Success

Happy-path tests prove your system works when everything is fine. That’s the easy part.

We injected latency into staging environments. Killed instances randomly. Returned garbage from mocked dependencies. Simulated network partitions between services. Every one of these tests caught bugs that would have hit production eventually.

Load testing matters too, but don’t just test peak throughput. Test sustained load. Test spiky traffic. Test what happens when load ramps up while a dependency is already degraded. That combination is what actually happens in production.

The Real Lesson

Every distributed system I’ve worked on has taught me the same thing: failure isn’t an edge case. It’s an input. The question is never “will something fail?” It’s “when this fails, does the system do something reasonable?”

At the fintech startup, the systems that survived best weren’t the ones with the cleverest algorithms or the most redundancy. They were the ones where we’d thought through every dependency and asked: “What do we do when this is gone?” And then actually built the answer.