The Economics of State: Why Scaling Up Beats Sharding (Until It …

Quick take

The short version: state scales cheapest when you get more out of one machine before you distribute. Connection pooling is free headroom, replication is the first major lever, and partitioning buys time before sharding complexity. I’ve been through this progression at the fintech startup and the order matters more than any single technique.

The context

At the fintech startup we ingest financial news from thousands of sources. Earnings calls, filings, analyst reports, social signals – all of it lands in PostgreSQL. When I joined as CTO, the database was a single server doing everything: writes from the ingestion pipeline, reads from the API, analytics queries from internal tools. It worked. Then the data volume doubled in three months and it stopped working.

This post is what I learned fixing it. Not theory. The actual progression we followed, the configs we used, and the tradeoffs we hit.

Find the bottleneck first

Everything looks the same when the database is slow. Queries pile up, latency spikes, the app feels sluggish. But the fix depends entirely on where the pressure is coming from.

Read pressure means CPU is saturated serving SELECT queries. Write pressure means WAL generation and fsync are the bottleneck. Connection pressure means the process-per-connection model is eating memory. Storage pressure means the tables are too big for efficient vacuuming and indexing.

At the fintech startup, we had all four. But they didn’t all matter equally. Connection pressure was killing us first because the ingestion workers each held their own connection. We had maybe 300 workers and the server was spending more time context-switching between backends than doing actual work.

Connection pooling with PgBouncer

This was the first fix and it was nearly free. PgBouncer sits between the application and PostgreSQL, multiplexes hundreds of client connections over a small pool of real database connections, and eliminates the fork-per-connection overhead.

Here is close to what we ran:

[databases]
fintech_db = host=127.0.0.1 port=5433 dbname=fintech_db

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 5432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt

pool_mode = transaction
max_client_conn = 2000
default_pool_size = 30
reserve_pool_size = 5
reserve_pool_timeout = 3

server_idle_timeout = 300
server_lifetime = 3600

log_connections = 0
log_disconnections = 0

A few things worth noting. pool_mode = transaction is the right default for almost every workload. It returns the server connection to the pool after each transaction completes, so 2000 application connections share 30 real database connections. Session mode keeps the connection pinned for the entire client session, which defeats the purpose. Statement mode is the most aggressive but breaks anything that uses multi-statement transactions or prepared statements.

reserve_pool_size gives you a small buffer for traffic spikes. If all 30 connections are busy, PgBouncer will open up to 5 more for a short window before rejecting clients. This saved us during batch ingestion runs where write volume would spike for a few minutes.

The result: our 300 ingestion workers plus API servers plus internal tools all shared 30 actual PostgreSQL backends. Memory usage on the database server dropped by 40%. Query latency improved because the server wasn’t spending cycles managing hundreds of idle connections.

Streaming replication

Pooling bought us time. The next bottleneck was read traffic. The API served search results, financial summaries, and news feeds – all read-heavy queries competing with the write path on a single server.

PostgreSQL streaming replication is straightforward. The primary streams WAL (Write-Ahead Log) segments to one or more standbys, which replay them and serve read traffic. Here is the relevant part of the primary’s configuration:

# postgresql.conf on the primary
wal_level = replica
max_wal_senders = 5
wal_keep_segments = 64

# For monitoring replication lag
track_commit_timestamp = on

And the standby’s recovery.conf:

standby_mode = on
primary_conninfo = 'host=primary.internal port=5432 user=replicator password=xxx'
trigger_file = '/tmp/postgresql.trigger'

We set up two read replicas behind a simple connection routing layer in Go. Write queries went to the primary. Read queries went to the replicas using round-robin. Nothing fancy.

The thing people underestimate about replication is lag. A user writes a comment, the API returns success, the user refreshes and the comment isn’t there because the replica hasn’t caught up. At the fintech startup this mattered less because our data was financial news – a few hundred milliseconds of lag on search results is invisible. But for any read-after-write path, you need to either route reads back to the primary for a short window or track replication position and only read from replicas that have caught up.

We monitored lag with a simple query on the replicas:

SELECT
  now() - pg_last_xact_replay_timestamp() AS replication_lag;

In steady state we saw 50-200ms of lag. During batch ingestion spikes it would climb to 1-2 seconds. We set alerts at 5 seconds. If lag hits 10 seconds, something is wrong with the replica’s I/O or the WAL shipping is backing up.

Partitioning for high-volume tables

After pooling and replicas, the next problem was table size. Our main events table held every financial event we had ever ingested. Hundreds of millions of rows. VACUUM took hours. Index rebuilds blocked writes. Queries that should have been fast were scanning enormous B-trees.

In 2016, PostgreSQL doesn’t have native declarative partitioning. That’s coming in version 10. What we’ve is table inheritance with CHECK constraints and manual routing. It’s ugly. It works.

-- Parent table
CREATE TABLE financial_events (
    id          BIGSERIAL,
    event_time  TIMESTAMP NOT NULL,
    source_id   INTEGER NOT NULL,
    event_type  VARCHAR(50) NOT NULL,
    payload     JSONB NOT NULL
);

-- Monthly partitions
CREATE TABLE financial_events_2016_10 (
    CHECK (event_time >= '2016-10-01' AND event_time < '2016-11-01')
) INHERITS (financial_events);

CREATE TABLE financial_events_2016_11 (
    CHECK (event_time >= '2016-11-01' AND event_time < '2016-12-01')
) INHERITS (financial_events);

-- Indexes on each partition, not the parent
CREATE INDEX idx_fe_2016_10_time ON financial_events_2016_10 (event_time);
CREATE INDEX idx_fe_2016_10_source ON financial_events_2016_10 (source_id);

CREATE INDEX idx_fe_2016_11_time ON financial_events_2016_11 (event_time);
CREATE INDEX idx_fe_2016_11_source ON financial_events_2016_11 (source_id);

The insert routing was a trigger function:

CREATE OR REPLACE FUNCTION financial_events_insert_trigger()
RETURNS TRIGGER AS $$
BEGIN
    IF NEW.event_time >= '2016-11-01' AND NEW.event_time < '2016-12-01' THEN
        INSERT INTO financial_events_2016_11 VALUES (NEW.*);
    ELSIF NEW.event_time >= '2016-10-01' AND NEW.event_time < '2016-11-01' THEN
        INSERT INTO financial_events_2016_10 VALUES (NEW.*);
    -- ... older months
    ELSE
        RAISE EXCEPTION 'No partition for event_time %', NEW.event_time;
    END IF;
    RETURN NULL;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER insert_financial_events
    BEFORE INSERT ON financial_events
    FOR EACH ROW EXECUTE FUNCTION financial_events_insert_trigger();

Yes, you have to maintain this trigger as you create new partitions. We wrote a cron job in Go that created next month’s partition and updated the trigger on the first of every month. Not glamorous. Reliable.

The wins were immediate. VACUUM on a monthly partition with 20 million rows takes minutes, not hours. Queries with WHERE event_time BETWEEN ... hit only the relevant partitions because constraint exclusion prunes the rest. And when data aged past our retention window, we dropped entire partitions instead of running massive DELETE queries – DROP TABLE financial_events_2015_06 is instant and generates zero dead tuples.

Sharding: the last resort

We didn’t shard at the fintech startup. I want to be honest about that. We got to the edge where it was on the table, but the combination of pooling, two replicas, and monthly partitioning handled our volume. We were ingesting roughly 50,000 events per minute at peak and PostgreSQL on decent hardware with these optimizations kept up.

But I’ve seen sharding done at a previous company, and my strong opinion is: don’t do it unless you have exhausted everything above and a single primary genuinely can’t handle the write throughput.

Sharding means splitting data across independent PostgreSQL instances. You pick a shard key – usually a tenant ID or a hash of some natural key – and route writes and reads to the correct shard. The application or a middleware layer owns the routing.

The hidden costs are brutal. Schema migrations have to be applied to every shard. Cross-shard queries become application-level aggregation. Rebalancing shards when data distribution skews means moving live data between databases. Transactions that span shards are either impossible or require two-phase commit, which is slow and fragile. Every operational runbook gets multiplied by the number of shards.

If you’re at the point where sharding is necessary, you have a big enough team and budget to handle the operational complexity. If you don’t have that team, sharding will hurt more than the performance problem it solves.

The progression matters

The order isn’t arbitrary.

Pooling is nearly free and should be in place from day one. There’s no reason to let PostgreSQL manage hundreds of connections directly.
Read replicas are the first real scaling lever. They offload the most common pressure (reads) with minimal application changes.
Partitioning makes large tables manageable and solves vacuum, indexing, and retention problems that replicas don’t help with.
Sharding is the nuclear option. Powerful. Expensive. Irreversible in practice.

Each step is worth a significant amount of effort to delay the next one. We spent a week tuning PgBouncer and it delayed the need for replicas by two months. We spent two weeks setting up replication and it delayed the partitioning work by four months. Partitioning has delayed any sharding conversation indefinitely.

What I would do differently

I would set up PgBouncer before the first production deploy, not after the first connection storm. I would build the partitioning infrastructure from the start for any table expected to grow past 50 million rows. And I would invest more in monitoring replication lag early – we flew blind for the first few weeks after setting up replicas and got lucky that lag never caused a visible bug.

Scaling PostgreSQL is a sequence, not a leap. Each step buys you months. Skip the sequence and you pay for every shortcut at once.

The Economics of State: Why Scaling Up Beats Sharding (Until It Doesn't)