Why Monitoring Wasn't Enough and How We Built Observability at a …

It was 2 AM on a Wednesday, and our news ingestion pipeline at the fintech startup had gone silent. No errors in the logs. No alerts firing. CPU fine, memory fine, disk fine. Every dashboard said the system was healthy. But users were seeing stale financial news, some of it hours old.

I spent forty minutes SSH-ing into boxes, tailing logs, and grepping for exceptions. Nothing. The pipeline just… stopped processing. It took me another hour to discover the root cause: a third-party API we depended on had started returning empty 200 responses instead of actual data. Our monitoring checked for errors. It checked for timeouts. It never checked for “success that contains nothing useful.”

That night changed how I think about production systems.

Monitoring Checks Boxes. Observability Answers Questions.

Monitoring is built around things you already know can go wrong. Threshold crossed, alert fires, runbook engaged. It works great for predictable problems: disk filling up, CPU pegged, error rate spiking. We had all of that at the fintech startup. Grafana dashboards everywhere. PagerDuty wired up. It felt safe.

But distributed systems don’t fail in predictable ways. They fail in weird, combinatorial, never-seen-this-before ways. And that’s where monitoring falls apart. You can’t write an alert for a failure mode you haven’t imagined yet.

Observability flips the model. Instead of predefining what questions the system can answer, you instrument it richly enough that you can ask new questions on the fly. During an incident. At 2 AM. Without deploying anything.

The difference: monitoring tells you something is wrong. Observability helps you figure out why.

Three Signals, One Story

After the empty-200 incident, I started rebuilding our telemetry around three pillars.

Metrics give you the bird’s-eye view. Aggregated numbers over time. At the fintech startup, we track request rates, error rates, and latency distributions for every service using Prometheus.

http_requests_total{method="GET", endpoint="/api/stories", status="200"} 15234
http_request_duration_seconds_bucket{endpoint="/api/stories", le="0.5"} 421

Metrics are cheap to store and great for alerting. But they can’t tell you why a specific request failed or what user it affected. They’re the smoke detector, not the fire investigator.

Logs capture the details. Each event, each error, each decision the code made. We moved early to structured JSON logs because free-form strings are nearly useless at scale.

{"timestamp":"2018-07-09T10:30:45Z","level":"error","service":"ingestion","request_id":"abc123","message":"Empty response from provider","provider":"reuters","duration_ms":340}

That structured format meant we could query logs in Kibana instead of grepping through them. Game changer for incident response.

Traces show you the journey of a single request across services. This was the missing piece for us. The fintech startup’s architecture involves an ingestion service talking to NLP processors talking to a ranking engine talking to the API layer. When something is slow, you need to see the whole chain.

Trace abc123
- POST /ingest/batch (120ms)
  - NLP enrichment (45ms)
  - Relevance scoring (30ms)
  - db.write stories (35ms)

We started with Zipkin. Not perfect, but suddenly I could see that our NLP service was adding 200ms of latency on certain content types. That was invisible before.

The Glue: Correlation IDs

Each signal alone is useful. Together, connected by a shared identifier, they’re powerful.

Here’s the workflow that actually matters: a latency alert fires from metrics. You pull up the trace for a slow request. The trace shows a specific span taking too long. You jump to the logs for that span using the same request ID and find the exact query and error.

X-Request-ID: 5f3c9e86c2c84e1b
X-B3-TraceId: 4d1e00a3b9bd1d42
X-B3-SpanId: 6df3a1c2b93f6b1a

We made it a rule: every log line includes the request ID, and every trace propagates context headers. No exceptions. It took weeks to retrofit across all our services, but the payoff during the next incident was immediate. Instead of an hour of archaeology, I had a clear thread to pull.

Instrument the Critical Path First

When I started adding instrumentation at the fintech startup, the temptation was to instrument everything. Don’t do that. You’ll drown in data and your storage costs will spike.

Start with what matters: the critical path your users depend on.

For us that meant inbound API handlers, the news ingestion pipeline, database calls, and every external API dependency. We used RED (rate, errors, duration) for our services and USE (utilization, saturation, errors) for infrastructure resources.

Tracing has a sampling problem. Capturing every single request is expensive. We settled on keeping 100% of errors and sampling successful requests at about 5%. During incidents, we crank sampling up. Good enough for debugging, affordable enough to run continuously.

Designing Systems That Can Be Debugged

Observability isn’t something you bolt on after the fact. It’s a design choice.

Structured logs over free-form strings. Always. If you can’t query it, it’s useless when you’re under pressure at 2 AM.

Watch your cardinality. Metrics with unbounded label values (like user IDs) will destroy your Prometheus instance. High-cardinality data belongs in logs and traces, not metrics.

Add context that matters. A trace that only shows timings is half the story. Include the request type, the tenant, the result count. When something goes wrong, that context is the difference between a five-minute fix and a two-hour hunt.

Our Stack in 2018

For anyone building this out now, here’s what we were running:

Metrics: Prometheus with Alertmanager, Grafana for dashboards
Logs: Fluentd piping into Elasticsearch, Kibana for exploration
Tracing: Zipkin with OpenTracing instrumentation

OpenTracing is getting solid adoption and keeps us from being locked into one vendor. OpenCensus is emerging as an alternative worth watching. Both are pushing toward a world where trace context propagation just works out of the box.

There are commercial options that bundle everything behind one query layer. We looked at them. For our scale, the open source stack made more sense. That calculus changes depending on team size and how much operational overhead you can absorb.

What That 2 AM Incident Taught Me

That silent pipeline failure was a gift. It exposed a blind spot in how I thought about production readiness. Having dashboards isn’t the same as having understanding. Monitoring answers the questions you thought to ask. Observability gives you the ability to investigate the questions you didn’t.

We still have incidents at the fintech startup. But now when something breaks in a way we’ve never seen before, we have the telemetry to figure it out fast. That’s the whole point.

Why Monitoring Wasn't Enough and How We Built Observability at a Fintech Startup