Stop Guessing: How I Fix Slow Databases

| 7 min read |
databases performance postgresql sql

The repeatable process I use at the fintech startup to diagnose and fix database performance problems instead of throwing random indexes at the wall.

Last month one of our price history endpoints at the fintech startup ground to a halt. Eight seconds per request. We’re talking about a query that pulls stock price records for a given ticker over a date range. Should be trivial. Our first instinct was the usual panic — someone slapped a new index on the table, someone else bumped shared_buffers, and a third person started rewriting the query. None of it helped. Because nobody had actually looked at what was slow or why.

I’ve seen this pattern enough times now that I refuse to participate in it. Here’s how I actually approach database performance problems.

“The database is slow” isn’t a problem

That sentence tells me nothing. Which query? Which endpoint? What’s the acceptable latency? When did it start? What changed?

At the fintech startup, when someone says the API is slow, I make them fill in the blanks: “The /api/prices/{ticker} endpoint exceeds 2 seconds during market hours after we added the sentiment join last Tuesday.” Now I have a target. A timeline. Something I can actually investigate.

Without that specificity you’re debugging vibes.

Measure first. Always.

Before touching anything, turn on slow query logging and get a baseline. I run PostgreSQL for everything I can, so:

ALTER SYSTEM SET log_min_duration_statement = '50ms';
SELECT pg_reload_conf();

I set the threshold low — 50ms — because I want to see the full picture, not just the obvious disasters. You’d be surprised how many “fast” queries add up when they run thousands of times per minute.

The slow query log tells you what’s actually slow. Not what your gut says is slow. Not what the intern thinks is slow. What the database measured as slow.

EXPLAIN ANALYZE is your best friend

I mean this literally. If you’re doing database work and you don’t use EXPLAIN (ANALYZE, BUFFERS) regularly, you’re flying blind.

Here’s what I ran for our price history problem:

EXPLAIN (ANALYZE, BUFFERS)
SELECT p.ticker, p.price_date, p.open, p.close, p.volume
FROM price_history p
WHERE p.ticker = 'AAPL'
  AND p.price_date BETWEEN '2017-01-01' AND '2017-08-01'
ORDER BY p.price_date;

The output told me everything. The planner estimated 200 rows. Reality: 147,000. When estimates are that far off, PostgreSQL picks terrible plans. It chose a nested loop where a hash join would’ve been orders of magnitude faster. The statistics on price_history were stale — we’d bulk-loaded six months of historical data and never ran ANALYZE.

ANALYZE price_history;

One command. Query went from 8 seconds to 90 milliseconds. No new index. No config change. Just accurate statistics.

That’s the thing about database tuning. The fix is often embarrassingly simple once you know what’s wrong.

Finding the actual bottleneck

After you read the plan, you’re looking for one of a few usual suspects:

Sequential scans on large tables. If PostgreSQL is scanning 50 million rows to find 12, you need an index. But only for the columns your queries actually filter on.

Joins that explode row counts. We had a query joining price_history to news_sentiment that produced a cartesian product on certain tickers. The join condition was missing a date constraint. The query went from returning 500 rows to materializing 3 million intermediate rows.

Lock contention. Bulk inserts fighting with read queries. We see this during our nightly data loads. pg_stat_activity tells you who’s waiting on whom.

Memory pressure. If your working set doesn’t fit in shared_buffers, every query hits disk. Check your buffer cache hit ratio:

SELECT
  sum(heap_blks_hit) / nullif(sum(heap_blks_hit) + sum(heap_blks_read), 0) AS cache_hit_ratio
FROM pg_statio_user_tables;

Anything below 0.99 on a read-heavy workload means you’re probably leaving performance on the table. Literally.

Pick the biggest contributor. Fix that one thing. Measure again.

Indexing: powerful but not free

I love indexes. I also hate bad indexes. Every index you add slows down writes and consumes disk. “Index everything” is a strategy born from laziness.

At the fintech startup our price_history table gets thousands of inserts per minute during market hours. We can’t afford garbage indexes. So I index what the queries need, and nothing else.

CREATE INDEX idx_price_history_ticker_date
ON price_history (ticker, price_date DESC);

Column order matters. This index is perfect for “give me AAPL prices sorted by date.” Reverse the columns and the planner might not use it at all.

PostgreSQL partial indexes are incredibly useful for skewed data:

CREATE INDEX idx_active_alerts
ON price_alerts (ticker, created_at)
WHERE active = true;

Only 5% of alerts are active at any given time. This index is tiny compared to indexing the full table, and it’s the only query pattern we care about for active alerts.

Query shape matters more than you think

Stop writing SELECT *. I see this constantly. You need three columns but you’re transferring 40 because the ORM defaulted to SELECT *. On a table with a JSONB column holding 2KB of metadata per row, that’s the difference between a 50ms query and a 500ms query.

Rewrite subqueries as joins when it makes sense. PostgreSQL’s planner is good but it’s not magic. I had a correlated subquery that checked whether each ticker had recent sentiment data:

-- Slow: correlated subquery, runs once per row
SELECT ticker, close
FROM price_history
WHERE price_date = '2017-08-01'
  AND EXISTS (
    SELECT 1 FROM news_sentiment
    WHERE news_sentiment.ticker = price_history.ticker
      AND sentiment_date > '2017-07-01'
  );

-- Fast: rewritten as join
SELECT DISTINCT p.ticker, p.close
FROM price_history p
JOIN news_sentiment s ON s.ticker = p.ticker
WHERE p.price_date = '2017-08-01'
  AND s.sentiment_date > '2017-07-01';

The join version let PostgreSQL hash the sentiment table once instead of probing it for every row. 40x improvement.

Connection management

This one bites people who don’t expect it. PostgreSQL forks a process per connection. 200 connections means 200 processes fighting over CPU and memory. I’ve seen a server with 16 cores running 500 connections, where cutting to 50 connections through PgBouncer doubled throughput.

Your connection pool size should match your database’s capacity, not your application’s thread count. For most workloads, (2 * CPU cores) + number of disks is a reasonable starting point. We run 24 connections at the fintech startup for a 12-core database server and it handles our load fine.

Configuration: change one thing at a time

PostgreSQL ships with conservative defaults because it has to run on everything from a Raspberry Pi to a 256GB server. A few settings worth adjusting:

  • shared_buffers: 25% of RAM is the standard advice. It’s decent advice.
  • effective_cache_size: 75% of RAM. Tells the planner how much cache to expect from the OS.
  • work_mem: Be careful here. This is per-operation, not per-query. A complex query with five sorts uses 5 * work_mem. I’ve seen people set this to 1GB and then wonder why they’re OOM.
  • maintenance_work_mem: Crank this up for VACUUM and CREATE INDEX operations. 512MB or 1GB is reasonable.

Change one setting. Measure. If it helps, keep it. If it doesn’t, revert. This isn’t the place for creativity.

Don’t ignore the ORM

If you’re using Django, Rails, or any ORM — you need to know what SQL it generates. ORMs are convenient. They’re also capable of producing spectacularly bad queries when you’re not paying attention. N+1 queries are the classic, but I’ve seen ORMs generate queries with unnecessary subselects, missing join conditions, and implicit type casts that defeat indexes.

Log the SQL. Read it. If it’s bad, drop to raw SQL for that query. The ORM won’t be offended.

The loop

That’s really all there’s to it. Define the problem precisely. Measure with real data. Read the execution plan. Fix the biggest bottleneck. Measure again. Repeat.

It’s not glamorous. There’s no single magic setting that makes everything fast. But this process works every time, and it works because you’re making decisions based on data instead of folklore.

Every performance win I’ve had at the fintech startup came from this loop. The boring, repeatable loop.