// Topics / Reliability

Reliability

Definition

Reliability coverage in this archive spans 18 posts from Jul 2016 to Jan 2026 and focuses on reliability, delivery speed, and cost discipline as one system, not three separate concerns. The strongest adjacent threads are architecture, sre, and ai. Recurring title motifs include production, ai, outage, and taught.

Key claims

Most posts prioritize predictable operations over feature breadth or stack novelty.
Early posts lean on systems and production, while newer posts lean on engineering and outage as constraints shifted.
This topic repeatedly intersects with architecture, sre, and ai, so design choices here rarely stand alone.

Practical checklist

Set SLOs first, then choose tooling that keeps deploy, observability, and rollback simple.
Start with the newest post to calibrate current constraints, then backtrack to older entries for first principles.
When boundary questions appear, cross-read architecture and sre before committing implementation details.

Failure modes

Adding platform layers faster than the team can operate and debug them.
Chasing throughput gains without proving they improve end-user reliability.
Applying guidance from 2016 to 2026 without revisiting assumptions as context changed.

Suggested reading path

Start here (current state): Building Reliable AI Agents in Go
Then read (operating middle): Your Load Tests Are Lying to You
Finish with (foundational context): Building Resilient Systems: Lessons from Production Failures

References

25 entries tagged “Reliability”

The Benchmark You Didn't Build July 16, 2026 · 4 min Public benchmarks are contaminated and gamed. The only eval that matters runs on your traffic, your failure modes, your bar—and you own it. ai reliability metrics

Agentic Systems at Scale: The New Reliability Contract July 7, 2026 · 4 min Agentic systems need SRE-style reliability contracts with explicit blast-radius limits, fallback paths, and kill switches. ai reliability operations

The Anti-Fragile AI Organization July 2, 2026 · 4 min The best AI organizations do not merely survive model churn and vendor shocks; they convert each one into a capability they keep. teams ai reliability

Designing the AI Leadership Bench: Roles, Interfaces, and Failure Boundaries June 10, 2026 · 2 min AI scaling needs explicit leadership interfaces between product, platform, reliability, and governance. leadership teams ai

How to Run an AI Incident Review That Changes Architecture, Not Slides June 2, 2026 · 2 min Incident reviews should produce architecture deltas and control updates, not narrative theater. reliability ai governance

AI Production Governance: A Maturity Model April 23, 2026 · 4 min The gap between stable AI features and shipping chaos isn't tools—it's production governance. How mature teams evaluate, deploy, and roll back. governance ai reliability

Why Most Enterprise AI Architecture Fails in Year One April 21, 2026 · 3 min In 2026, enterprise AI isn't failing because models are bad. It is failing because organizations are building brittle demos instead of bounded, operable systems. architecture ai reliability

Building Reliable AI Agents in Go January 19, 2026 · 6 min Reliable agents are engineered, not prompted: bounded tools, validation at every step, explicit recovery paths. Here's how I build them in Go. agents reliability ai

AI Incidents Don't Look Like Outages. That's the Problem. November 10, 2025 · 4 min AI systems can return 200 OK while confidently wrong. How to detect, contain, and learn from AI incidents using proven incident response principles. incident-management ai reliability

Agentic Workflows: From Demo Magic to Production Reality April 1, 2024 · 6 min AI agents that can take actions are fundamentally different from chatbots. The engineering bar must match the blast radius. agents ai production

Why I Run Multiple Models in Production March 18, 2024 · 4 min Betting on a single model provider is like having a single database with no failover. Here is why multi-model is the only sane production strategy. ai architecture llm

The AWS us-east-1 Outage Was Predictable. Your Architecture Was Not Ready. December 20, 2021 · 4 min December 7 reminded everyone that us-east-1 is a single point of failure for half the internet. Again. I am annoyed. aws outage reliability

What a 3 AM Outage Taught Me About Incident Management November 29, 2021 · 6 min Good incident response is not about preventing failure. It is about failing well. Lessons from a decade of on-call, including national cyber-defense and telecom-scale operations. incident-management sre on-call

Database Reliability Engineering: What I've Learned the Hard Way August 9, 2021 · 7 min Practical database reliability from running Postgres in production: configs, safe migration patterns, and the operational habits that prevent outages. databases reliability sre

Most Chaos Engineering Is Theater June 8, 2020 · 3 min Teams love saying they do chaos engineering. Few actually have hypotheses. Even fewer fix what they find. chaos-engineering reliability sre

Zero Downtime Deploys Are a Team Habit, Not a Tool October 21, 2019 · 5 min Every team says they want zero downtime. Few want to do the boring work that actually gets them there. Here's what that boring work looks like. deployment devops kubernetes

Your Load Tests Are Lying to You August 26, 2019 · 3 min Most load tests produce comforting numbers instead of useful answers. Here's what I learned the hard way about getting honest results. testing performance reliability

Your SLOs Are Probably Useless (Here's How to Fix Them) May 20, 2019 · 6 min Most SLOs are dashboards nobody acts on. Pick indicators that reflect real users, set targets from data, and make error budgets change how your team ships. sre slo reliability

Design for Failure or It Will Design Your Weekend May 6, 2019 · 3 min Failure is not an edge case but the default state you hold off with good engineering. Hard-won rules for systems that bend instead of shatter. reliability architecture distributed-systems

Async Job Processing: Patterns That Saved Us at a Fintech Startup December 17, 2018 · 7 min Hard-won patterns for reliable background job processing -- queues, retries, idempotency, and the failures that taught me to care about all three. backend architecture async

What Building Distributed Systems at a Fintech Startup Taught Me About Failure September 17, 2018 · 6 min Hard-won lessons from designing distributed systems that survive real failures -- timeouts, retries, bulkheads, and the habits that keep things running. distributed-systems reliability architecture

SRE Principles Are Great. The Cargo-Culting Is Not. April 30, 2018 · 5 min The SRE hype train has everyone copying Google's playbook without asking whether it fits. What actually matters when you're not running at planet scale. sre devops reliability

You Don't Need to Be Netflix to Break Things on Purpose August 21, 2017 · 4 min Chaos engineering isn't just for the big players. Here's how a small team can start breaking things deliberately and actually learn from it. chaos-engineering reliability testing

How I Build Data Pipelines That Actually Survive Production April 24, 2017 · 6 min Every pipeline I've built at the fintech startup broke at some point. Here's the design approach that made them recoverable instead of catastrophic. data-engineering etl pipelines

Building Resilient Systems: Lessons from Production Failures July 18, 2016 · 7 min Production incidents show where architecture bends and breaks. Lessons on designing for failure, limiting blast radius, and making recovery routine. reliability resilience architecture