// Topic
Production
Definition
Production coverage in this archive spans 27 posts from Feb 2016 to Jul 2026 and treats production as a production discipline: evaluation loops, tool boundaries, escalation paths, and cost control. The strongest adjacent threads are ai, llm, and infrastructure. Recurring title motifs include ai, production, engineering, and kubernetes.
Key claims
- The archive repeatedly argues that production only creates leverage when it is wired into an existing workflow.
- Early posts lean on production and kubernetes, while newer posts lean on ai and production as constraints shifted.
- This topic repeatedly intersects with ai, llm, and infrastructure, so design choices here rarely stand alone.
Practical checklist
- Define quality gates up front: eval sets, guardrails, and explicit rollback criteria.
- Start with the newest post to calibrate current constraints, then backtrack to older entries for first principles.
- When boundary questions appear, cross-read ai and llm before committing implementation details.
Failure modes
- Shipping agent behavior without hard boundaries for tools, data access, and approvals.
- Optimizing for model novelty while ignoring reliability, latency, or cost drift.
- Applying guidance from 2016 to 2026 without revisiting assumptions as context changed.
Suggested reading path
- Start here (current state): AI Engineering Is Its Own Discipline Now
- Then read (operating middle): Function Calling Patterns That Survive Production
- Finish with (foundational context): Docker in Production: What We Learned Running Containers at Dropbyke
Related posts
- AI Engineering Is Its Own Discipline Now
- AI Security: Evolving Threats and Defenses
- AI-Native Architecture Patterns 2026
- AI Video Applications in Practice
- AI Incidents Don’t Look Like Outages. That’s the Problem.
- AI Workflow Automation: Decisions Are Cheap, Actions Are Expensive
- AI Customer Support That Doesn’t Make People Hate You
- AI Security: Same Principles, New Attack Surface
References
27 posts
- AI Production Governance: A Maturity Model
By mid-April 2026, the gap between teams shipping stable AI features and teams shipping chaos isn't tools—it's production governance. Here is how mature teams evaluate, deploy, and rollback.
AI Security: Evolving Threats and Defenses
As of late February 2026, AI security is defined by adaptive attacks and layered, operational defenses.
AI-Native Architecture Patterns 2026: Production Guide
Production AI architecture patterns for gateways, retrieval, evaluation, fallbacks, cost control, and ownership.
AI Video Applications in Practice
Video AI is practical for scoped workflows. This post covers what works, how to design for reliability, and where human review still matters.
AI Incidents Don't Look Like Outages. That's the Problem.
Your AI system can return 200 OK and still be wrong, unsafe, or confidently hallucinating. Here's how to detect, contain, and learn from AI incidents -- drawing from the same IR principles that work for traditional systems.
AI Workflow Automation: Decisions Are Cheap, Actions Are Expensive
The trick to AI workflow automation is simple: let the model decide, let deterministic code act, and never confuse the two.
AI Customer Support That Doesn't Make People Hate You
Most AI support systems are built to deflect tickets. The ones that actually work are built around escalation, grounding, and the simple idea that customers aren't idiots.
AI Security: Same Principles, New Attack Surface
AI systems are exposed APIs with real blast radius. The threats are injection, leakage, and tool misuse. The defenses are the same ones we've always needed -- just applied to a new surface.
Testing AI Where It Actually Runs
Offline evals are necessary but not sufficient. Here's how I test AI features in production with shadow mode, canaries, and rollback automation -- with Go code.
Your AI System Looks Healthy. It Is Not.
Traditional monitoring will tell you your AI service is up. It won't tell you it's returning confident garbage. Here's what observability actually looks like for AI.
Reasoning Models in Production: A Practical Guide
Reasoning models are powerful but expensive and slow. Here's how I integrate them in Go services with routing, async patterns, and cost controls that actually work.
Your AI Infrastructure Is Not Special
AI infrastructure at scale is just infrastructure. The same boring patterns -- gateways, caching, circuit breakers, budget enforcement -- solve the same boring problems.
AI Safety Is Just Production Engineering
AI safety in production isn't a research problem. It's defense in depth, the same way cyber defense works -- layered controls, assumed breach, observable boundaries.
Function Calling Patterns That Survive Production
Function calling is how LLMs touch real systems. Treat tools like APIs, arguments like untrusted input, and permissions like the model is an intern with root access.
Agentic Workflows: From Demo Magic to Production Reality
AI agents that can take actions are fundamentally different from chatbots. The engineering bar must match the blast radius.
Why I Run Multiple Models in Production
Betting on a single model provider is like having a single database with no failover. Here is why multi-model is the only sane production strategy.
AI Engineering Is Its Own Discipline Now
AI engineering is not ML research with a product hat. It is the discipline of making models behave in production -- and it demands its own skill set.
LLM Observability: Your Existing Monitoring Is Not Enough
Traditional monitoring tells you the service is up. It doesn't tell you the model started confidently returning garbage last Tuesday. Here's how to actually observe LLM systems.
AI in Production Is Just Engineering. Treat It That Way.
ChatGPT changed expectations overnight, but shipping AI features that actually work is an engineering problem, not a model problem.
Your Staging Environment Is Lying to You
Staging never catches the real bugs. Here's how I learned to test in production without burning everything down.
The Boring Kubernetes Checklist That Actually Keeps Production Alive
Most Kubernetes outages come from skipping the basics. Here's the checklist I use after running clusters at the fintech startup and now at Decloud.
GraphQL in Production Is Harder Than They Tell You
After a year running GraphQL at the fintech startup, here's what the conference talks leave out.
Two Years of Kubernetes in Production — The Boring Parts Are the Hard Parts
Year two of running Kubernetes at the fintech startup. The panic is gone. Now it's networking, resource tuning, and all the operational grunt work nobody blogs about.
A Year Running Kubernetes in Production — What Actually Happened
After a year of running Kubernetes in production, the wins are real but the sharp edges drew blood first. Here's what paid off, what bit us, and what I'd do differently.
Why We Deleted 42 Grafana Panels
Most teams monitor too much and alert on the wrong things. Five metrics are enough to run a startup backend.
Building Resilient Systems: Lessons from Production Failures
Production incidents show where architecture bends and where it breaks. These lessons focus on designing for failure, limiting blast radius, and making recovery routine.
Docker in Production: What We Learned Running Containers at Dropbyke
Running Docker in production at Dropbyke forced us to get serious about image builds, container networking, log aggregation, and security. Here is what actually worked.