Design for Failure or It Will Design Your Weekend

| 3 min read |
reliability architecture distributed-systems engineering

Failure is not an edge case. It is the default state you temporarily hold off with good engineering. A few hard-won rules for building systems that bend instead of shatter.

I’m halfway through my EF batch in Singapore, building Decloud, and I keep having the same conversation with other founders here: “We’ll handle reliability later.” Later. The word that has personally cost me more sleep than any production bug.

At the fintech startup, I watched a single slow Elasticsearch query cascade through our entire API layer. One degraded dependency. Total platform outage. The fix took ten minutes. The recovery took four hours. All because nothing in the request path had a timeout.

At Dropbyke, a forgotten WAL retention setting filled a disk at 3 AM and I discovered our “tested” failover was eleven hours behind. Forty minutes of locked bikes across Seoul. The monitoring said everything was fine. The monitoring was wrong.

These weren’t exotic failures. They were boring, preventable ones. The kind that happen when you assume dependencies work and never verify what happens when they don’t.

Three rules I actually follow

Set a deadline on everything. Every outbound call gets a timeout. Every request gets a budget. If a dependency can’t answer in time, you move on without it. Slow failure is worse than fast failure because it holds resources hostage while it dies.

Isolate the blast radius. A slow search index should never starve your payment flow. Separate connection pools. Separate queues. The goal is simple: one problem stays one problem.

Know your fallback before you need it. A stale cache hit is better than a 500. A default list of popular items is better than a blank page. But the fallback has to be intentional. Accidental fallbacks are just bugs you haven’t noticed yet.

The pattern that keeps saving me

Circuit breakers. Dead simple concept. If a dependency is failing, stop calling it. Serve the fallback. Check back later. It turns a cascading outage into a graceful degradation that most users never notice.

The key insight: a breaker that’s open isn’t a failure state. It’s a success state. It means the system chose fast, predictable behavior over slow, unpredictable death.

What I got wrong early on

I used to think resilience meant more redundancy. Add a replica. Add a region. Add a retry. But redundancy without testing is just a more expensive single point of failure. That Dropbyke replica was a perfect example. It existed. It was running. It was useless.

Now I test the recovery path, not just the happy path. If you haven’t promoted your replica under realistic conditions in the last quarter, you don’t have a failover. You have a hope.

The uncomfortable truth

Designing for failure isn’t a technical problem. It’s a prioritization problem. Every founder and every CTO knows they should do it. Most don’t because the next feature feels more urgent. It always feels more urgent.

Until 3 AM on a Thursday, when it doesn’t.