Zero Downtime Deploys Are a Team Habit, Not a Tool

Quick take

Stop looking for a deploy tool that promises zero downtime. It doesn’t exist. Zero downtime is a discipline across code, database migrations, and infrastructure – and the hardest part is the schema changes, not the YAML.

I’ve shipped code that took down production exactly twice in my career. Once at the fintech startup, when a database migration locked a table for eleven minutes during London market open. Once during my EF batch, when a colleague and I pushed a config change that broke backward compatibility with in-flight requests. Both times, the deploy tooling worked perfectly. The problem was us.

That’s the thing nobody tells you about zero downtime. It’s not about the rollout strategy. It’s about every decision surrounding the rollout.

What “Zero Downtime” Actually Means

Two things. That’s it.

Capacity never drops below what traffic needs.
Running code stays compatible with in-flight requests and existing data.

If both hold, your deploy is invisible to users. If either breaks, you have an outage regardless of how fancy your canary setup is.

The Real Killer: Database Migrations

Rolling updates, blue-green, canary – pick your favorite. They all handle the application layer fine. The hard part is always the database.

A schema change that locks a table or drops a column still in use will take you down no matter what deployment pattern you’re running. I learned this the painful way with Postgres at the fintech startup, where our users table had enough rows that a naive ALTER TABLE would hold a lock for minutes.

The fix is boring. Expand, migrate, contract. In that order, across separate deploys.

-- Deploy 1: Expand (add the new column)
ALTER TABLE users ADD COLUMN phone VARCHAR(20);

-- Deploy 2: Migrate (backfill in batches, not one giant UPDATE)
UPDATE users SET phone = legacy_phone WHERE phone IS NULL LIMIT 1000;

-- Deploy 3: Contract (drop old column after all code uses the new one)
ALTER TABLE users DROP COLUMN legacy_phone;

Three deploys for one column rename. Annoying? Yes. But nobody notices. That’s the point.

For large tables, tools like gh-ost or pt-online-schema-change do background copies to avoid locks entirely. Worth learning if you’re running anything with real traffic.

Pick a Rollout Pattern and Keep It Simple

Rolling Updates

The default and usually the right choice. Set maxUnavailable: 0 so you never lose capacity during the rollout.

spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 25%

This works when old and new versions can run side by side. Which means your APIs need to be backward compatible. If that sounds like extra work, it is. But it’s the kind of extra work that prevents 3 a.m. pages.

Canary

Send a small slice of traffic to the new version. Watch your error rate. If it’s clean, ramp up. If it spikes, roll back automatically.

The key word there is “automatically.” A canary release without automated rollback is just a smaller blast radius with manual intervention. You might as well flip a coin.

Blue-Green

Run two environments, switch traffic when the new one is verified. Fast rollback, but you’re paying for double infrastructure. Fine for critical services. Overkill for most things.

The Stuff Between the Deploys

The patterns above are table stakes. What actually makes deploys boring (in a good way) is the plumbing around them.

Readiness vs. liveness probes. These aren’t the same thing. Liveness says “is this process alive.” Readiness says “can this process handle traffic right now.” If your readiness check passes before caches are warm and dependencies are verified, you’re sending users to a half-alive instance.

Graceful shutdown. When Kubernetes sends SIGTERM, your app needs to stop accepting new connections, finish in-flight requests, then exit. Sounds obvious. Roughly half the Go services I’ve reviewed get this wrong, usually by not waiting long enough for the load balancer to drain.

Connection draining. Set a realistic termination grace period. If your longest request takes 30 seconds, a 5-second grace period is going to drop connections.

DNS TTLs. Lower them before a cutover, raise them after. I’ve seen teams spend hours debugging a “failed deploy” that was actually stale DNS.

When to Actually Worry

Here’s a quick mental checklist I run before every deploy:

Can the new code read data written by the old code? Can the old code read data written by the new code?
Does the database migration need a lock? How long?
Do the readiness checks actually verify dependencies, or do they just return 200?
If this deploy goes wrong, can I roll back in under a minute?

If any answer is “I’m not sure,” that’s where the work is. Not in the deploy tool.

Why this stays hard

Zero downtime deploys aren’t hard technically. They’re hard culturally. They require everyone on the team to think about backward compatibility, schema evolution, and graceful degradation before they write the code. Not after.

The teams I’ve seen do this well – at the fintech startup, at startups in my EF cohort, at companies I’ve advised – they all share one trait. Deploys are boring. Nobody watches them. Nobody holds their breath. The monitoring catches problems and the rollback is automatic.

That’s the goal. Not zero downtime as a feature on a slide deck. Zero downtime as a habit so ingrained that nobody even talks about it anymore.