Monitoring Is Not Enough

Most teams think they have monitoring figured out. Dashboards, thresholds, PagerDuty. Something spikes, someone gets paged, someone fixes it. Works great when you have a monolith and three endpoints.

It falls apart the second you split into services.

I learned this the hard way at the fintech startup. We had decent Grafana dashboards. Reasonable alerts. Then we started breaking things into microservices and deploying multiple times a day. A request would fail somewhere in a chain of five services and our dashboards would just… look fine. Every individual service reported healthy metrics. The problem lived in the gaps between them.

That’s the core issue with monitoring. It answers questions you already thought to ask. Latency on this endpoint? Sure. Error rate on that queue? Got it. But the failure you actually hit in production is the one you never predicted. Your dashboards have no panel for it.

Observability is a different mindset

Observability means you instrument your system so you can ask new questions after something breaks. Not just “is it up” but “why did this specific user’s request take 8 seconds at 3am on Tuesday.”

Three signals make this work: metrics for trends and alerts, logs for detail, and traces for stitching a single request across every service it touches. Separately they are useful. Together they are a debugging superpower.

Structured logging is the foundation. Stop writing free-text log lines. Make every log entry a JSON object with a trace ID, service name, version, and whatever fields you actually need to filter on.

{
  "timestamp": "2017-03-20T10:23:45Z",
  "level": "error",
  "message": "Payment processing failed",
  "trace_id": "abc123",
  "service": "payment-service",
  "version": "2.1.3"
}

Then propagate context. Generate a trace ID at the edge and carry it through every downstream call. This is the single most important thing you can do. Without it you’re just grepping logs and praying.

X-Trace-Id: abc123
X-Span-Id: def456
X-Parent-Span-Id: ghi789

How incidents actually change

Alert fires. Pull the trace. Follow the slow span to the source. Read the logs for that trace. Confirm with metrics whether it’s one user or everyone. Done. Five minutes instead of forty-five minutes of bouncing between dashboards while your Slack channel fills with “any update?”

Same thing with performance. User says it’s slow. Pull their trace. Compare to a healthy one. See the difference – a cache miss, an extra database round-trip, a third-party call timing out. You fix the actual cause instead of guessing.

The tools don’t matter that much

Prometheus, InfluxDB, ELK, Jaeger, Zipkin – pick whatever fits your stack. Commercial platforms that bundle all three signals save time. But the tooling isn’t the hard part. The hard part is disciplined instrumentation. Consistent field names. Trace IDs everywhere. Every team following the same conventions.

What actually matters

Observability isn’t a product you buy. It’s a practice you build. You stop staring at dashboards waiting for red. You start asking questions about behavior you didn’t expect. That shift – from reactive to exploratory – is the entire point. And in a world where every team is shipping services independently, it’s the only way to stay sane.

Monitoring Is Not Enough

Observability is a different mindset

How incidents actually change

The tools don’t matter that much

What actually matters

Assumptions

Limits

References