// Topic
Monitoring
Definition
Monitoring coverage in this archive spans 9 posts from Dec 2016 to Apr 2025 and focuses on reliability, delivery speed, and cost discipline as one system, not three separate concerns. The strongest adjacent threads are observability, devops, and production. Recurring title motifs include ai, observability, monitoring, and enough.
What the archive argues
- Most posts prioritize predictable operations over feature breadth or stack novelty.
- Early posts lean on monitoring and enough, while newer posts lean on ai and ebpf as constraints shifted.
- This topic repeatedly intersects with observability, devops, and production, so design choices here rarely stand alone.
Execution checklist
- Set SLOs first, then choose tooling that keeps deploy, observability, and rollback simple.
- Start with the newest post to calibrate current constraints, then backtrack to older entries for first principles.
- When boundary questions appear, cross-read observability and devops before committing implementation details.
Common failure modes
- Adding platform layers faster than the team can operate and debug them.
- Chasing throughput gains without proving they improve end-user reliability.
- Applying guidance from 2016 to 2025 without revisiting assumptions as context changed.
Suggested reading path
- Start here (current state): Testing AI Where It Actually Runs
- Then read (operating middle): eBPF Is Interesting. I Am Not Sold Yet.
- Finish with (foundational context): Why We Deleted 42 Grafana Panels
Related posts
- Testing AI Where It Actually Runs
- Your AI System Looks Healthy. It Is Not.
- OpenTelemetry in Late 2021: What’s Ready and What’s Not
- Observability-Driven Development Is Just Instrumenting Your Code
- eBPF Is Interesting. I Am Not Sold Yet.
- Observability for Small Distributed Teams (What Actually Works)
- Why Monitoring Wasn’t Enough and How We Built Observability at a Fintech Startup
- Monitoring Is Not Enough
References
9 posts
- Testing AI Where It Actually Runs
Offline evals are necessary but not sufficient. Here's how I test AI features in production with shadow mode, canaries, and rollback automation -- with Go code.
Your AI System Looks Healthy. It Is Not.
Traditional monitoring will tell you your AI service is up. It won't tell you it's returning confident garbage. Here's what observability actually looks like for AI.
OpenTelemetry in Late 2021: What's Ready and What's Not
Tracing is ready. Metrics are getting there. Logs are not. Here's a practical adoption path and the code to back it up.
Observability-Driven Development Is Just Instrumenting Your Code
ODD sounds fancy. It's not. It means writing logs, metrics, and traces before you ship, not after your first outage.
eBPF Is Interesting. I Am Not Sold Yet.
eBPF promises kernel-level observability without the pain of kernel modules. The tech is real. The hype-to-adoption ratio concerns me.
Observability for Small Distributed Teams (What Actually Works)
Most observability advice is written for 500-engineer orgs. Here's what actually matters when you're a small distributed team trying not to drown in dashboards.
Why Monitoring Wasn't Enough and How We Built Observability at a Fintech Startup
After a mystery outage that our dashboards couldn't explain, I rebuilt the fintech startup's telemetry stack around metrics, logs, and traces. Here's what I learned.
Monitoring Is Not Enough
Your dashboards look green. Your users say the site is broken. That gap is the whole problem.
Why We Deleted 42 Grafana Panels
Most teams monitor too much and alert on the wrong things. Five metrics are enough to run a startup backend.