Quick take
Most business continuity plans are useless binder filler. Real continuity is about people, not servers. Know who can do what when half your team is gone, kill your single points of failure, and actually practice the plan. The 50-page PDF nobody reads won’t save you.
It’s April 2020 and the world is on fire. COVID just stress-tested every company’s business continuity plan, and the results are… not great.
I’ve been through a few versions of this movie. At the fintech startup we had engineers across multiple countries, so the “what if someone disappears” scenario wasn’t hypothetical. At Dropbyke we had hardware in the field and a tiny team. I’ve read more BCP documents than any human should. Most of them are garbage.
The typical BCP is written by a management consultant who has never ssh’d into a production server. It’s 40 pages of org charts and escalation matrices living in a SharePoint folder nobody remembers. When the actual crisis hits, people ignore it and open Slack.
Let me talk about what actually works. From an engineering perspective. Over coffee, not a boardroom.
Continuity is about people, not systems
Disaster recovery and business continuity are different things. DR is “the database is gone, restore from backup.” BCP is “three of your five engineers are sick, the office is closed, and your biggest client needs a deployment by Friday.”
The second one is harder. Way harder.
Most engineering teams have never seriously asked: if person X gets hit by a bus tomorrow, what breaks? Not the servers. The knowledge. The person who knows why that cron job runs at 3am. The person with the credentials to the legacy payment gateway. The person who understands what happens if you flip that config flag.
I’ve seen teams lose weeks because one person went on vacation and nobody else knew how to deploy to staging. Vacation. Not a pandemic. Vacation.
Kill your single points of failure
You already know about redundant servers and multi-region deployments. I’m talking about the human single points of failure that every team pretends don’t exist.
Audit these right now:
- The deploy gatekeeper. Only one person can push to production? Fix it today.
- The credentials hoarder. Critical passwords live in one person’s head or their personal 1Password vault? You’re one resignation away from a lockout.
- The tribal knowledge holder. That engineer who built the system three years ago and never documented anything? Pair someone with them. This week.
- The single VPN. Your whole remote access goes through one gateway? Add another one and actually test the failover.
None of this is glamorous. Nobody gets promoted for writing runbooks. But it’s the difference between “we handled it” and “we lost a week.”
Documentation as a survival tool
I know, I know. Engineers hate writing docs. I hate writing docs. But documentation isn’t a compliance artifact. It’s a survival tool. You’re writing it for the panicked version of your teammate at 2am who can’t reach you.
Focus on three things:
- Runbooks for the scary stuff. Not “how to use git.” How to restore the database. How to roll back a bad deploy. How to rotate compromised credentials. The stuff you can’t afford to figure out in real time.
- Access maps. Who has access to what, and how does someone else get it if that person is gone. Sounds basic. Most teams can’t answer it completely.
- Decision records. Why did we pick Postgres over Mongo? Why is this service deployed separately? When the person who made the call leaves, the context leaves with them. Write it down.
Cross-training is the other half. At the fintech startup we made it a rule that at least two people could handle any critical system. We rotated on-call and deployments so the knowledge was real, not theoretical.
Remote-readiness is continuity
Three weeks ago, companies were debating whether engineers could work from home. Now everyone’s remote and half the VPNs are melting.
If your CI/CD pipeline requires someone to be physically in the office, that’s a continuity failure. If your auth depends on the office network, continuity failure. If your docs live behind a VPN that handles 20 concurrent connections and you have 200 engineers… you get it.
The fix isn’t complicated. Cloud CI/CD. Zero-trust auth. Accessible docs. Most modern teams have this already. But “most modern teams” is a smaller group than the industry likes to admit.
Know your tiers
When capacity drops — and it will — you need to already know what matters. Don’t figure this out during the crisis.
Keep alive at all costs: Production uptime. Security incident response. Data integrity and backups.
Important but can flex: Bug fixes for paying customers. Billing. Support tooling.
Can wait: New features. Refactors. That Kubernetes migration you’ve been planning.
Have this conversation with product and leadership before things go sideways. In crisis mode, every stakeholder thinks their thing is critical. Set the tiers in advance. Get sign-off. Write it down.
Your vendors are your problem
If Stripe goes down, your checkout goes down. That’s your problem, not Stripe’s. Customers don’t care about your vendor’s SLA.
For every critical vendor: Can you degrade gracefully? Can you queue and retry? Is there a fallback, even a manual one?
You don’t need a hot standby for every service. But you need to have thought about it. “We’ll figure it out” isn’t a plan.
Practice or it doesn’t count
The plan that’s never been tested is just a wish. I’ve seen beautifully written BCP documents fall apart the first time someone tried to follow them. Steps were wrong. Access had changed. The backup contact had left the company six months ago.
Run drills. Restore from backup and time it. Do a deploy from home on a random Tuesday. Simulate a vendor outage. Even a 30-minute tabletop exercise (“okay, AWS us-east-1 is down, what do we do?”) will expose gaps you didn’t know about.
When you find gaps — and you will — fix them with specific action items, owners, and deadlines. Not “we should improve this.” That means nothing.
The real talk
Business continuity isn’t a document. It’s a muscle. You build it by doing the boring work: writing runbooks, rotating knowledge, testing backups, having the uncomfortable conversation about what happens when key people are unavailable.
The companies handling this pandemic well aren’t the ones with the best BCP binders. They’re the ones who already operated like things could break at any time. Because things always break. The timing is the only surprise.