At Dropbyke we went through the classic monitoring arc. Started with nothing, panicked after an outage, installed StatsD and Grafana, then spent two weeks shipping every metric we could find into dashboards nobody looked at. We had 47 Grafana panels. Forty-seven. And when the next incident hit, we still couldn’t figure out what was broken for twenty minutes.
The problem wasn’t tooling. The problem was noise. So we deleted 42 of those panels. Kept five metrics. Built real alerts around them. Our on-call engineer started sleeping again.
Most monitoring is waste
Here is what I’ve learned running production systems: if you’re watching more than a handful of metrics, you’re watching none of them. Human attention doesn’t scale. Twenty dashboards with eight panels each means you have zero dashboards, because nobody is actually reading them under pressure.
The instinct after an outage is always “we need more visibility.” Wrong. You need better visibility. Those are different things.
Five numbers, nothing else
For a startup backend serving real users, these are the only metrics I care about on a daily basis:
1. Request latency at p95 and p99. Not average. Averages lie. A service averaging 40ms with a p99 at 3 seconds is broken for one in a hundred users, and that one user is filing the support ticket. We tracked this per-service in Prometheus and set alerts at the Grafana layer.
2. Error rate by type. Not just “5xx count.” Break it down. A spike in 401s is a different problem than a spike in 503s. One is probably a bad deploy or a client bug. The other is your database falling over. The distinction matters because the response is completely different.
3. Saturation of your bottleneck resource. For us that was PostgreSQL connection pool utilization. For you it might be memory, disk I/O, or worker threads. The point is: know which resource will kill you first and watch that one. Not all resources. That one.
4. Request throughput. Traffic going up is good. Traffic dropping to zero at 2pm on a Tuesday is very bad. This metric is less about the number and more about detecting sudden changes. A 50% drop in requests is an incident whether or not errors are firing.
5. Deployment markers. Not a metric in the traditional sense, but I overlay every deploy on our Grafana dashboards. Half the incidents I’ve investigated started with “someone shipped something.” Correlating metrics with deploys cuts your mean time to diagnosis in half.
That’s it. Five things.
Alerts should wake you up for a reason
We had a rule: if an alert fires and the on-call engineer can’t take a meaningful action within five minutes, delete the alert. Brief CPU spikes? Not actionable. Disk at 80%? Only actionable if there’s a runbook. “Something might be wrong” isn’t an alert. It’s anxiety.
Our paging alerts were:
- p99 latency above threshold for 5 minutes
- Error rate above 1% for 3 minutes
- Primary database connection pool above 90%
Three alerts. That was it for paging. Everything else went to a Slack channel that people checked during business hours.
The discipline is in what you remove
After I stripped our monitoring down, the team pushed back. “What if we miss something?” My answer: we were already missing things. Forty-seven panels meant nobody looked at any of them carefully. Five panels meant the on-call engineer actually understood the state of the system at a glance.
Monitoring isn’t a collection problem. It’s an attention problem. Treat it that way.
Five metrics, three alerts, zero noise
Pick the five metrics that reflect what your users experience. Build alerts only for conditions where someone can take immediate action. Delete everything else. Discipline over dashboards.