Building Resilient Systems: Lessons from Production Failures

Quick take

Your failover probably doesn’t work. Test it before 3 AM teaches you that lesson instead.

The Night Our Failover Lied to Us

June, peak season at Dropbyke. We had just crossed a threshold where our fleet was large enough that evening ride demand was genuinely stressing our backend. The database was a PostgreSQL primary with a streaming replica. Standard setup. We had tested promotion of the replica exactly once, four months earlier, during a quiet Tuesday afternoon. It worked fine then.

At 2:47 AM on a Thursday, the primary ran out of disk. My fault. I had bumped up WAL retention for debugging a replication lag issue two weeks prior and forgot to revert it. The primary filled its volume, panicked, and went read-only.

No problem, I thought. We have a replica. I’ll promote it and we’re back.

The replica was 11 hours behind.

Eleven hours. The replication lag had been silently growing for days. Our monitoring checked that the replica process was running. It didn’t check how far behind it actually was. The process was alive and healthy. The data wasn’t.

So now I’m sitting in my kitchen at 3 AM with a read-only primary that has current data and a replica that thinks it’s yesterday afternoon. If I promote the replica, I lose every ride, every payment, every account change from the last 11 hours. If I don’t promote it, the entire system stays read-only and nobody can start a ride.

I ended up provisioning a new volume, copying the data directory from the primary, and mounting it with more space. Took about 40 minutes. Forty minutes of complete service outage during which bikes were locked all over the city and our support inbox was filling up.

The postmortem was humbling. We had a failover strategy that we believed worked because we had tested it once under ideal conditions. We had monitoring that checked the wrong thing. And the root cause was a config change I made and forgot about. No exotic bug. No sophisticated attack. Just a forgotten setting, a lazy health check, and an untested assumption.

That night changed how I think about resilience.

Failures Are Ordinary. Cascades Aren’t.

Every system I’ve operated has failed. Networks drop packets. Services crash when they leak memory. Hardware dies. Certificates expire. Third-party APIs go down at the worst possible moment.

None of that’s surprising. The question is never whether something will fail. It’s whether one failure drags everything else down with it.

After the Dropbyke incident, I started categorizing failures differently. I stopped caring about the probability of individual failures and started obsessing over blast radius. A database going read-only is a problem. A database going read-only that also kills authentication, ride tracking, and payment processing is a catastrophe.

Degradation Is a Feature

When a dependency fails, the system needs a plan that isn’t “wait and hope.” At Dropbyke, after the outage, we built explicit degradation modes. If the database went read-only, users could still end active rides using cached state. They couldn’t start new ones, but at least nobody was stranded.

This isn’t a good experience. It’s a usable one. That distinction matters more than most engineers think. Users tolerate “the app is slow right now” much better than “the app is completely dead.” Give them something, anything, while you fix the real problem.

A recommendation engine falls back to popular items. A payment system queues orders for later. A search feature lets people browse categories. None of these are ideal. All of them buy you time.

Isolation: Keep the Fire in One Room

The reason our Dropbyke outage was total instead of partial is that everything depended on everything. One database, one connection pool, one failure domain. A spike in ride-end processing could starve the authentication path. A slow query in analytics could block real-time tracking.

After the incident, we separated concerns aggressively.

Bulkheads. Dedicated connection pools for critical paths versus background work. If the analytics queries get slow, they burn their own pool, not the one serving live rides.

Circuit breakers. If a dependency is unhealthy, hammering it makes things worse. We wrapped external calls so that after enough failures, the system stops trying and returns a fallback immediately. Simple implementation:

class CircuitBreaker:
    def call(self, func, *args, **kwargs):
        if self.state == "open" and not self._should_attempt_recovery():
            raise CircuitOpenError()
        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception:
            self._on_failure()
            raise

Nothing clever. That’s the point. Resilience mechanisms should be boring and predictable.

Timeouts on everything. Every external call gets a timeout. Without one, a slow dependency consumes your capacity until you’re the slow dependency for someone else. Start generous, tighten based on real latency data.

Rate limiting. Overload is a failure mode. Treat it like one. When you hit a limit, respond clearly and fast so callers know when to retry.

Redundancy You Actually Test

We had a replica. We thought we had redundancy. We didn’t. We had a replica that made us feel safe without actually being safe.

After that night, the rule became: if you haven’t failed over to it this month, it doesn’t count as redundancy. Stateless services are easy. Any instance handles any request. Stateful systems are harder. Replication is the tool, but the tradeoffs matter. Synchronous replication costs latency and gives you stronger durability. Asynchronous replication is faster but you can lose recent writes. Pick one and understand what you’re accepting.

The important part is exercising the failover path regularly. Not reading a runbook. Actually doing it. In production, during business hours, with someone watching the metrics.

Observability That Checks the Right Thing

Our monitoring checked “is the replica process running?” It should have checked “how many bytes behind is the replica?” Those are very different questions.

After the outage, I rewrote our monitoring with a simple principle: monitor what the user experiences, not what the machine reports. Error rates. Latency percentiles. Successful ride starts per minute. If those numbers move, something is wrong, and you don’t need to know which machine is unhappy to start responding.

Structured logging with correlation IDs so you can trace a request across services. Distributed tracing to see where time goes. These aren’t optional for anything beyond a single-process application.

But the biggest lesson was this: your monitoring must not share fate with the thing it monitors. If your alerting depends on the same database that just died, you won’t get the alert. We learned that one the hard way too, but that’s a different story.

Recovery as a Practiced Skill

Automated recovery should be the default for common failures. Health checks, orchestrator restarts, connection retry logic. These fix most issues without waking anyone up.

Rollback needs to be fast and safe. Keep previous builds ready. Make database migrations backward-compatible so you can roll back the application without rolling back the schema. Practice the rollback path before you need it at 3 AM.

After major incidents, bring services back gradually. Watch the metrics. Pause if anything looks wrong. The urge to flip everything back on at once is strong. Resist it.

Discipline Over Heroics

The 3 AM kitchen table fix worked. I got the system back up. People called it heroic. It wasn’t. It was a failure of discipline that required an emergency response.

Heroic fixes feel good in the moment. Blameless postmortems that ship actual fixes feel better six months later when the same scenario hits and the system handles it without waking anyone up.

On-call should be sustainable. Clear escalation paths. Well-maintained runbooks. If your on-call rotation burns people out, your system isn’t resilient, you’re just subsidizing its fragility with human suffering.

Build for the 3 AM you haven’t met yet

Build for failure. Contain blast radius. Rehearse recovery. Monitor what matters. When these become routine, most production failures turn into minor disruptions instead of long outages.

The replica lag incident at Dropbyke was painful. But it taught me something I still carry: the system you think you have and the system you actually have are different things. The only way to close that gap is to test your assumptions regularly and honestly. Not once on a quiet Tuesday. Every month, under real load, with real stakes.