Your Load Tests Are Lying to You

At Dropbyke we had real-time GPS tracking for thousands of bikes across Seoul. Before a big launch push, we ran a load test that said we could handle 10x our current traffic. Two days after launch, the system buckled at 3x. The test was wrong because we’d tested flat HTTP requests against a clean database. Real traffic had WebSocket connections holding state, bursty GPS updates, and a database full of six months of location history.

That failure taught me something I keep relearning: the test is only as good as the model.

What actually matters

Forget throughput vanity metrics. A useful load test answers specific questions:

Which component breaks first? (Spoiler: it’s almost never what you expect.)
What does degradation look like? Errors? Timeouts? Silent data loss?
How long does recovery take after the spike passes?

If your test doesn’t answer those, you’re just generating graphs.

The tests worth running

Soak test. Run moderate load for 8-12 hours. This is where Go’s garbage collector surprises you, where connection pools quietly leak, where that in-memory cache grows without bounds. I’ve caught more production bugs with soak tests than any other kind.

Spike test. Jump from normal to 5x and back. At the fintech startup, our fintech API handled steady financial data queries fine but fell over during earnings season spikes because the autoscaler couldn’t keep up. Knowing your recovery curve matters more than knowing your peak.

Baseline test. Boring but essential. Run it after every major change. Without a baseline, you can’t tell if this week’s deploy made things slower or if you’re just imagining it.

Stress tests – crank it until it breaks – are the least useful in my experience. You already know it’ll break. The question is whether it breaks gracefully.

Three rules I follow

Use production-shaped data. If your production Postgres has 50M rows, don’t test against 10K. The query planner behaves completely differently. I’ve seen teams celebrate perfect benchmarks that fell apart because their test database fit in RAM.

Include think time. Real users pause between clicks. Without pauses, you’re testing with infinite-speed robots, which inflates concurrency numbers and gives you false confidence.

Track percentiles, not averages. A p50 of 80ms and a p99 of 12 seconds means your “average” user is fine but 1 in 100 is having a terrible time. Averages hide the pain.

On tooling

I use k6 these days. It’s written in Go, scripts are JavaScript, and it produces clean output. But honestly, the tool matters less than the discipline. Store your test scripts in version control. Run them in CI. Compare results across releases. A load test you run once is trivia. A load test you run every deploy is infrastructure.

Load testing isn’t about proving your system is fast. It’s about finding out where it’s weak before your users do.

Your Load Tests Are Lying to You

What actually matters

The tests worth running

Three rules I follow

On tooling

Assumptions

Limits

References