Observability-Driven Development Is Just Instrumenting Your Code

| 4 min read |
observability monitoring development sre

ODD sounds fancy. It's not. It means writing logs, metrics, and traces before you ship, not after your first outage.

“Observability-Driven Development” has entered the conference talk circuit and I already hate the name. It sounds like a methodology. Like you need a certification or a Slack channel or a retro format.

It’s not a methodology. It’s just: instrument your code before you ship it. That’s it. That’s the whole thing.

And yet I keep walking into systems – at startups, at enterprises, everywhere – where observability was bolted on after the first production fire. The logs are a mess. Every service logs differently. Metrics exist for some endpoints but not others. Nobody has traces. The dashboards are either empty or full of vanity charts that nobody looks at.

The actual problem

When observability comes last, every team invents their own approach. Service A logs JSON with request_id. Service B logs plain text with req_id. Service C doesn’t log request IDs at all. You find this out at 2am during an outage while trying to correlate a failure across three services.

I’ve lived this. At the fintech startup we had a period where our financial data pipeline logs were useless for debugging cross-service issues because every team had picked their own field names. Fixing that retroactively took weeks. If we’d agreed on a format before writing the services, it would’ve taken an afternoon.

What “ODD” actually means in practice

Before you write a feature, answer three questions:

  1. How will I know this is working correctly in production?
  2. How will I know this is slow?
  3. How will I know this is broken?

If you can’t answer those, you’re not ready to write the code. That’s the whole framework. No acronym needed.

Structured logs or nothing

Your logs should be structured JSON with consistent field names across every service. Non-negotiable.

{"level":"info","event":"order_created","request_id":"abc-123","order_id":"ord-456","duration_ms":42}

Every log line gets a request_id. Every log line gets an event name. Every log line gets a level. If you’re logging fmt.Println("something happened") in production Go code, we need to talk.

Pick your field names once. Write them down. Enforce them in code review. This is boring work. It pays off every single incident.

Metrics: RED and stop

For services, use RED metrics. Rate, Errors, Duration. For every endpoint. That’s your baseline.

I see teams go wild with custom metrics on day one. Thirty metrics per service, half of them never queried. Meanwhile they’re missing basic error rate tracking on their most critical endpoint. Start with RED. Add custom metrics when you have a specific question that RED can’t answer.

One thing that will absolutely burn you: high-cardinality labels. If you’re putting user IDs or full URLs into metric labels, you’re building a cost bomb. I saw one team’s Prometheus storage costs triple in a month because someone added a path label that included query parameters. Keep labels to things like method, status_code, service. Low and predictable.

Traces aren’t optional

Distributed tracing used to feel like a luxury. It’s not. If you’re running more than two services, you need traces. Full stop.

Every inbound request starts a trace. Every outbound call propagates the trace context. This is a few lines of middleware in Go. It’s trivial. And it’s the difference between “I think the problem is in the payment service” and “I can see the exact call that took 4 seconds.”

Sample if you need to for cost reasons. But sample consistently – don’t sample 100% on staging and 1% on production and then wonder why you can never find the trace you need.

Make it part of code review

This is where it sticks or falls apart. If observability isn’t in your code review checklist, it won’t happen.

When I review a PR that adds a new endpoint, I look for:

  • Does the handler emit RED metrics?
  • Are key events logged with stable fields?
  • Does the trace propagate to downstream calls?
  • Are the labels safe?

If the answer is no, the PR isn’t ready. Same as missing tests. Same as missing error handling. Observability isn’t a follow-up ticket. It ships with the feature.

Alerts that don’t suck

Most alerts are terrible. They fire on every blip, train everyone to ignore them, and then nobody notices when something actually breaks.

Alert on symptoms, not causes. Alert on “error rate is above X% for Y minutes,” not “one request returned a 500.” Better yet, use SLO-based alerts. Set an error budget. Alert when you’re burning through it too fast. This single change cut our alert noise at Decloud by something like 80%.

Stop making this complicated

The observability vendor ecosystem wants you to believe this is complex. It’s not. Structured logs, RED metrics, distributed traces, and alerts that fire on actual problems. Agree on conventions. Enforce them in review. Ship them with every feature.

That’s observability-driven development. No manifesto required.