Database Migrations Without Downtime

Quick take

If your migration strategy is “take the site down, run ALTER TABLE, pray, bring it back up,” you’re doing it wrong. Every schema change can be decomposed into steps that keep old and new code running simultaneously. It takes more discipline but zero heroics.

The problem with “just run the migration”

At the fintech startup we serve financial data. Market data doesn’t pause because your engineering team needs a maintenance window. A feed that goes dark during trading hours is a feed that loses subscribers. When I took over database operations, the existing pattern was to schedule migrations at 2 AM on Sundays, take the API offline, run the DDL, and hope the application came back cleanly. It worked until it didn’t.

The breaking point was a column type change on a table with 40 million rows of historical price data. The migration ran for 47 minutes. During that time the table was locked, the API returned errors, and a downstream consumer silently switched to a stale cache it never recovered from. We spent the next two days cleaning up data consistency issues.

After that I decided every migration would be zero-downtime or it wouldn’t ship.

The expand-and-contract pattern

The core idea is simple. Instead of making a breaking change in one step, you split it into phases where the schema is always compatible with whatever application code is currently running.

Expand: add the new structure alongside the old one. Both coexist.

Migrate: deploy code that writes to both old and new, reads from new. Backfill historical data.

Contract: once every running instance uses the new structure, remove the old one.

This pattern handles nearly every schema change. The details vary, but the rhythm stays the same.

Adding a column safely

The simplest case. A nullable column with no default doesn’t rewrite the table in PostgreSQL. It’s metadata-only and takes a lock for milliseconds.

ALTER TABLE trades ADD COLUMN settlement_date DATE;

Fast. Safe. Old code ignores the column. New code starts writing to it.

If you need a NOT NULL constraint, don’t add it in the same statement. PostgreSQL will scan the entire table to verify the constraint, holding an ACCESS EXCLUSIVE lock the whole time. Instead:

-- Step 1: add nullable column
ALTER TABLE trades ADD COLUMN settlement_date DATE;

-- Step 2: backfill in batches (see below)

-- Step 3: add constraint without full table scan
ALTER TABLE trades ADD CONSTRAINT trades_settlement_not_null
  CHECK (settlement_date IS NOT NULL) NOT VALID;

-- Step 4: validate separately (only takes a SHARE UPDATE EXCLUSIVE lock)
ALTER TABLE trades VALIDATE CONSTRAINT trades_settlement_not_null;

The NOT VALID trick is critical on large tables. It tells PostgreSQL to enforce the constraint on new writes immediately but skip the full table scan. The VALIDATE step runs later with a weaker lock that doesn’t block writes. On our 40-million-row tables, this was the difference between a 200-millisecond migration and a 20-minute outage.

Renaming a column

You don’t rename columns in production. Full stop.

What you actually do is expand-and-contract. Here is the concrete sequence we used when renaming price to unit_price on the trades table:

-- Phase 1: Expand
ALTER TABLE trades ADD COLUMN unit_price NUMERIC(18,8);

Deploy code that writes to both price and unit_price. Reads come from unit_price with a fallback to price.

-- Phase 2: Backfill
UPDATE trades SET unit_price = price
WHERE unit_price IS NULL
AND id BETWEEN 1 AND 100000;

-- Repeat in batches...
UPDATE trades SET unit_price = price
WHERE unit_price IS NULL
AND id BETWEEN 100001 AND 200000;

Once backfill is complete and every application instance reads from unit_price:

-- Phase 3: Contract
ALTER TABLE trades DROP COLUMN price;

Three deployments minimum. That’s the cost. The benefit is zero downtime and a clean rollback at every step.

Changing a column type

Same pattern, different details. We had a case where an instrument identifier was stored as INTEGER but needed to become TEXT to support a new exchange format.

-- Expand
ALTER TABLE instruments ADD COLUMN external_id_new TEXT;

-- Backfill
UPDATE instruments SET external_id_new = external_id::TEXT
WHERE external_id_new IS NULL
AND id BETWEEN 1 AND 50000;

Deploy dual-write code. Cut over reads. Drop the old column.

The temptation is to use ALTER COLUMN ... TYPE TEXT which rewrites the entire table under an exclusive lock. On a table with millions of rows of financial data that’s actively being queried, that isn’t an option.

Batched backfills that don’t kill the database

Large backfills are where most zero-downtime migrations go wrong. A single UPDATE ... WHERE condition on 40 million rows will generate enormous WAL, bloat the table, and hold locks that block concurrent operations.

The pattern I use:

DO $$
DECLARE
  batch_size INT := 5000;
  max_id BIGINT;
  current_id BIGINT := 0;
BEGIN
  SELECT MAX(id) INTO max_id FROM trades;

  WHILE current_id < max_id LOOP
    UPDATE trades
    SET unit_price = price
    WHERE id > current_id
      AND id <= current_id + batch_size
      AND unit_price IS NULL;

    current_id := current_id + batch_size;
    PERFORM pg_sleep(0.1);  -- breathe
    COMMIT;
  END LOOP;
END $$;

The pg_sleep matters. Without it, you saturate I/O and replication lag spikes. At the fintech startup our replicas served read traffic for dashboards and analytics. A backfill that caused 30 seconds of replication lag would show stale prices to every user watching a portfolio. We settled on batches of 5,000 rows with 100ms pauses. The backfill took longer but production stayed healthy.

Indexes without locking

CREATE INDEX on PostgreSQL takes a SHARE lock on the table, which blocks writes for the duration of the build. On a large table, that can be minutes.

CREATE INDEX CONCURRENTLY idx_trades_settlement ON trades(settlement_date);

CONCURRENTLY builds the index without blocking writes. It takes longer and does two table scans instead of one, but it’s the only option for production tables. There are two caveats worth knowing:

First, it can’t run inside a transaction. If your migration tool wraps everything in a transaction, you need to handle this case separately.

Second, if it fails partway through, it leaves an invalid index behind. Check pg_stat_user_indexes and drop the invalid one before retrying.

SELECT indexrelid::regclass, indisvalid
FROM pg_index
WHERE NOT indisvalid;

Foreign keys without blocking

Adding a foreign key constraint with ADD CONSTRAINT ... FOREIGN KEY does a full table scan under an ACCESS EXCLUSIVE lock. Same trick as NOT NULL:

ALTER TABLE trades
  ADD CONSTRAINT fk_trades_instrument
  FOREIGN KEY (instrument_id) REFERENCES instruments(id)
  NOT VALID;

ALTER TABLE trades VALIDATE CONSTRAINT fk_trades_instrument;

The first statement adds the constraint for new writes only, taking a brief lock. The second validates existing rows with a much weaker lock.

The deployment sequence

The order matters. Getting it wrong means you have a window where the application and schema disagree.

Run expand migration. New columns, new indexes (concurrently), new constraints (NOT VALID). Schema is now compatible with both old and new code.
Deploy new application code. Writes to both old and new columns. Reads from new with fallback to old.
Run backfill. In batches, during low traffic if possible. Monitor replication lag and query latency.
Verify completeness. SELECT COUNT(*) FROM trades WHERE unit_price IS NULL should be zero.
Deploy read-cutover code. Reads exclusively from new columns. Still writes to both.
Run contract migration. Drop old columns, drop temporary constraints. Schema is clean.

Each step is independently reversible. If the backfill causes problems, pause it. If the new code has bugs, roll back the deployment. The old schema is still there. This is the entire point.

What this costs

More deployments. More code that handles two schemas simultaneously. Migrations that used to be one PR become three or four. Application code carries temporary dual-write logic that gets cleaned up in the contract phase.

It’s more work. But at the fintech startup, the alternative was telling financial data consumers that we needed a maintenance window during market hours. That conversation never goes well.

The discipline pays off in a different way too. When every migration follows expand-and-contract, deployments become boring. Nobody pages you at 2 AM for a schema change. Nobody holds their breath during a release. The process is mechanical and predictable.

The best migration is the one nobody notices. Boring by design, invisible in production.