Zero-Downtime Database Migrations Without the Drama

At Decloud, we once ran a migration that added a NOT NULL column with a default value to a table with about 40 million rows. PostgreSQL 9.x. The migration locked the table for a full rewrite. Reads blocked. Writes queued. The API started timing out. Customers noticed within seconds. We rolled it back, but the rollback itself took minutes because the lock queue was backed up.

The total outage was maybe eight minutes. Felt like an hour. And the fix was something we should have known: add the column nullable, backfill in batches, then add the constraint. Three deploys instead of one. No lock, no drama.

That experience permanently changed how I think about database migrations. The schema change itself is rarely the problem. The problem is live traffic hitting the change while it runs.

The expand-and-contract pattern

Every breaking schema change should follow expand-and-contract. It isn’t glamorous. It’s slow. It works.

Expand: Add the new structure alongside the old one. Deploy code that writes to both.

Migrate: Backfill data from old to new. Switch reads to the new path.

Contract: Once the old path is fully retired and you have confirmed stability, drop the old columns or tables.

A simple column rename looks like this:

-- Step 1: Expand
ALTER TABLE users ADD COLUMN full_name text;

-- Step 2: Backfill (in batches, see below)
UPDATE users SET full_name = name
WHERE id BETWEEN 1 AND 10000 AND full_name IS NULL;

-- Step 3: Contract (after all code reads full_name)
ALTER TABLE users DROP COLUMN name;

Three deploys. Three PRs. Each one is small, reversible, and independently verifiable. This is slower than a single ALTER TABLE ... RENAME COLUMN, but it doesn’t lock the table and it doesn’t require a maintenance window.

Application code has to play along

The migration only works if the application tolerates both schemas during the transition. This means:

Write to both columns during the expand phase. Old code writes to name. New code writes to both name and full_name.
Read from new, fallback to old. COALESCE(full_name, name) in your queries, or handle it in application code.
Feature flags help. Switch reads to the new column behind a flag. Flip it when the backfill is done.
Keep old queries working until you ship the contract phase. If something goes wrong after the backfill, you want the ability to revert without data loss.

The key discipline: never deploy a schema change and a code change that depends on it in the same release. Separate them. Deploy the schema first, then the code. If the schema deploy breaks something, you can roll back without touching the application.

Backfills that don’t melt production

A naive UPDATE users SET full_name = name on a table with tens of millions of rows will generate massive WAL, spike replication lag, and potentially lock out autovacuum. I’ve watched backfills bring down read replicas.

Batch it:

-- Process 5000 rows at a time
UPDATE users SET full_name = name
WHERE id BETWEEN :start AND :end
AND full_name IS NULL;

Run this in a loop with a small delay between batches. Keep transactions short. Commit after each batch. Record progress so the job can resume if it crashes.

Practical tips from doing this too many times:

Rate limit. 5,000 to 50,000 rows per batch depending on table width and load.
Short transactions. Long transactions hold locks and block vacuum.
Idempotent. The WHERE full_name IS NULL clause means you can safely restart.
Off-peak when possible. Not always an option, but backfilling during low traffic reduces risk.
Monitor replication lag. If replicas start falling behind, slow down or pause.

The operations PostgreSQL will punish you for

Some DDL operations are deceptively dangerous. They look like simple one-liners but can lock tables, rewrite data, or kill your replication.

Creating an index without CONCURRENTLY: A regular CREATE INDEX takes a SHARE lock on the table, blocking all writes until it finishes. On a large table, that can be minutes.

SET lock_timeout = '5s';
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);

Always use CONCURRENTLY. Always set a lock timeout so it fails fast instead of waiting indefinitely.

Adding NOT NULL with a default (pre-PG 11): Before PostgreSQL 11, ALTER TABLE ADD COLUMN ... DEFAULT ... NOT NULL rewrote the entire table. PG 11+ handles this without a rewrite, but if you’re on an older version, add the column nullable first, backfill, then add the constraint.

Large data copies and foreign key additions: Adding a foreign key validates every existing row. On a big table, that’s a scan with a lock. Use NOT VALID to add the constraint without validation, then VALIDATE CONSTRAINT separately:

ALTER TABLE orders ADD CONSTRAINT fk_customer
  FOREIGN KEY (customer_id) REFERENCES customers(id)
  NOT VALID;

-- Later, without blocking writes:
ALTER TABLE orders VALIDATE CONSTRAINT fk_customer;

The operational workflow

Before you touch production:

Document every step and its expected impact. Not a novel. A numbered list.
Rehearse on production-size data. Staging with 1,000 rows won’t catch lock issues on a 50 million row table.
Verify your backups work. Not “we’ve backups.” Actually restore and check.
Decide your abort criteria in advance. “If lock waits exceed 5 seconds or replication lag exceeds 30 seconds, we stop.”

During the migration:

Watch lock waits, query latency, and replication lag in real time.
Validate each step before moving to the next one.
Have the rollback commands ready in a terminal. Not in a wiki. In a terminal.

After:

Verify data integrity. Row counts, null rates, spot checks.
Keep the old columns around for a few days. You want a rollback window.
Clean up temporary code paths and old columns only after you’re confident.

Rollback isn’t always possible

Additive changes are easy to roll back – just drop the new column. Destructive changes aren’t. If you dropped a column and need it back, you need a restore from backup.

This is why the contract phase should be the last step, done only after the system has been stable for days or weeks. Keep old columns around longer than feels necessary. Storage is cheap. Outages aren’t.

The uncomfortable truth about zero-downtime migrations: they aren’t about clever SQL or sophisticated tooling. They’re about discipline. Small steps. Verified assumptions. Rollback plans written before you need them. The teams that do this well aren’t smarter. They’re more disciplined. And they get to keep their weekends.