Database Reliability Engineering: What I've Learned the Hard Way

Quick take

Define your RPO/RTO before you pick a replication strategy. Test your restores monthly – most backup failures are actually restore failures. Use safe migration patterns (nullable columns, concurrent indexes, lock timeouts). Monitor replication lag, slow queries, and disk growth. Practice failovers before you need them. This post includes the Postgres configs and SQL patterns I use.

I’ve been responsible for databases at the fintech startup (financial data platform, Postgres), at Decloud, and at several enterprises. The lesson that took me the longest to internalize: database reliability isn’t a technology problem. It’s a habits problem.

You can run the best hardware, the fanciest replication setup, the most expensive managed service. If you don’t practice restores, don’t test migrations against production data volumes, and don’t have runbooks for failover, you’re going to have a bad day. The technology just determines how bad.

Start with numbers, not architecture

Before touching any config, agree on three numbers with your team and your business stakeholders:

RPO (Recovery Point Objective): How much data can you lose? For the fintech startup’s financial data, the answer was “essentially zero” – we couldn’t afford to lose transactions. For an analytics pipeline, losing 5 minutes of data might be fine.

RTO (Recovery Time Objective): How long can the database be down? A consumer app might tolerate 30 seconds. A payment system might need sub-second failover.

Latency SLOs: What’s the p99 query latency you’re committing to? This drives capacity planning, connection pooling, and index strategy.

These numbers dictate everything downstream. Synchronous replication for near-zero RPO. Automated failover for aggressive RTO. Proper connection pooling and query optimization for latency SLOs. Don’t skip this conversation.

Replication: pick your tradeoff

At the fintech startup we ran synchronous replication for our primary financial data. The write latency penalty was real – every commit waited for at least one standby to acknowledge. But for a financial data platform, losing committed transactions wasn’t an option.

For our analytics and reporting databases, we used asynchronous replication. Faster writes, slight risk of data loss on primary failure. Acceptable for data that could be re-derived.

In Postgres, the config difference is small but the operational impact is significant:

# Synchronous replication - zero data loss, higher write latency
synchronous_commit = on
synchronous_standby_names = 'FIRST 1 (standby1, standby2)'

# Asynchronous replication - lower latency, possible data loss on failover
synchronous_commit = off

The mistake I keep seeing: running async replication without understanding the RPO implication. If your primary dies and the standby is 5 seconds behind, you just lost 5 seconds of data. If that includes a payment confirmation, you have a problem that no incident postmortem can fix.

Failover: automate with guardrails

Automated failover sounds great until it promotes a standby that’s behind the primary and you get split-brain. Or it triggers during a network partition and now you have two primaries accepting writes.

At a large telecom client, I helped set up Patroni for Postgres HA. The key decisions:

Health checks from multiple locations (not just one monitoring server)
Minimum replication lag threshold before promotion is allowed
Fencing of the old primary before the new one accepts writes
Manual fallback documented and practiced

# Patroni failover constraints
failover:
  maximum_lag_on_failover: 1048576  # ~1MB of WAL
  retry_timeout: 10
  ttl: 30

The maximum_lag_on_failover setting is critical. It prevents promotion of a standby that’s too far behind. I’ve seen teams leave this at the default and then wonder why they lost data during failover.

Test your failover. Quarterly at minimum. The first time you practice it, something will go wrong. Better during a planned exercise than during an actual outage at 3am.

Backups: the restore is the backup

I can’t emphasize this enough. Your backup strategy is only as good as your last tested restore.

At the fintech startup, we ran:

Daily base backups via pg_basebackup
Continuous WAL archiving to S3 for point-in-time recovery
Weekly automated restore tests into a staging environment
Monthly manual restore drills where an engineer actually performed the full recovery

The weekly automated test caught a corrupted backup exactly once. Once was enough. Without that test, we would have discovered the corruption during an actual emergency.

A solid Postgres backup setup:

# postgresql.conf - enable WAL archiving
archive_mode = on
archive_command = 'aws s3 cp %p s3://backup-bucket/wal/%f'
wal_level = replica

Point-in-time recovery is the real value here. Full backups let you restore to the last backup. WAL archiving lets you restore to any point in time. When someone runs a bad DELETE at 2:47pm, you can recover to 2:46pm. That capability has saved me more than once.

Safe migrations: the number one reliability risk

Schema changes cause more database incidents than hardware failures. I’ve seen it at every company. Someone runs an ALTER TABLE that locks a table for 20 minutes during peak traffic. The connection pool fills up. Upstream services start timing out. Cascade failure.

The patterns that keep you safe:

Add columns as nullable. Never add a NOT NULL column without a default to an existing table in a single step. Add it nullable, backfill, then add the constraint.

-- Step 1: Add nullable column
ALTER TABLE orders ADD COLUMN status_v2 text;

-- Step 2: Backfill in batches
UPDATE orders SET status_v2 = status
WHERE id BETWEEN 1 AND 10000 AND status_v2 IS NULL;
-- repeat for remaining batches

-- Step 3: Add constraint after data is clean
ALTER TABLE orders ALTER COLUMN status_v2 SET NOT NULL;

Build indexes concurrently. A regular CREATE INDEX locks the table for writes. On a large table, that’s an outage.

-- This blocks writes:
CREATE INDEX idx_orders_status ON orders(status);

-- This doesn't:
CREATE INDEX CONCURRENTLY idx_orders_status ON orders(status);

Set lock and statement timeouts. Prevent runaway migrations from holding locks indefinitely.

-- Before running migration
SET lock_timeout = '5s';
SET statement_timeout = '30s';

If a migration can’t acquire its lock within 5 seconds, it fails instead of queuing behind long-running queries and creating a pileup.

Test against production volumes. A migration that runs in 2 seconds on your dev database with 1,000 rows might take 45 minutes on production with 50 million rows. I learned this lesson at the fintech startup when a “quick” index build blocked writes for 12 minutes because we hadn’t tested against actual data volume.

Monitoring: watch the right things

Most database monitoring setups track too many things and miss the ones that matter. Here’s what I actually alert on:

Alert immediately:

Replication lag above threshold (for us, 10 seconds)
Disk usage above 80%
Connection pool saturation above 90%
Failed backup or missed WAL archive

Alert on sustained trends:

Slow query count increasing over 15 minutes
Lock wait time increasing over 5 minutes
Transaction wraparound approaching (Postgres-specific, but it will bite you)

Track in dashboards, don’t alert:

Query latency distribution (p50, p95, p99)
Table and index sizes over time
Autovacuum activity
Cache hit ratios

A Postgres-specific query I run weekly to find trouble:

-- Tables approaching transaction wraparound
SELECT relname, age(relfrozenxid) as xid_age,
       pg_size_pretty(pg_total_relation_size(oid)) as total_size
FROM pg_class
WHERE relkind = 'r' AND age(relfrozenxid) > 200000000
ORDER BY age(relfrozenxid) DESC;

Transaction wraparound is the kind of problem that gives you no warning and then takes the database down hard. Monitor it.

Maintenance isn’t optional

Postgres needs vacuuming. It needs statistics updates. It needs index maintenance. These aren’t nice-to-haves.

Autovacuum handles most of the work, but the defaults are conservative. For busy tables:

# Per-table override for high-write tables
ALTER TABLE events SET (
    autovacuum_vacuum_scale_factor = 0.01,
    autovacuum_analyze_scale_factor = 0.005
);

The default autovacuum_vacuum_scale_factor of 0.2 means vacuum runs after 20% of the table has changed. On a 100-million-row table, that’s 20 million dead tuples before vacuum kicks in. Dropping it to 0.01 keeps things cleaner.

Index bloat is the other silent killer. Indexes grow as data is updated and deleted. Periodically check pg_stat_user_indexes for indexes with low usage and high size. Rebuild bloated indexes during maintenance windows using REINDEX CONCURRENTLY (Postgres 12+).

The operational habits that matter

Technology is the easy part. Habits are what prevent outages:

Monthly restore drills. Someone on the team restores a backup from scratch. Time it. Document the steps. Every time.
Quarterly failover tests. Actually promote a standby. Verify data integrity. Fail back. Fix whatever surprised you.
Runbooks for every alert. If an alert fires and the on-call engineer has to figure out what to do from scratch, the runbook is missing.
Post-incident follow-through. Writing the postmortem is easy. Actually fixing the root cause is what most teams skip.

Database reliability isn’t something you build once. It’s something you practice until the team can handle a 3am failover without panicking. The configs and queries above are tools. The discipline to use them consistently is the actual hard part.