Production

Definition

Production coverage in this archive spans 27 posts from Feb 2016 to Jul 2026 and treats production as a production discipline: evaluation loops, tool boundaries, escalation paths, and cost control. The strongest adjacent threads are ai, llm, and infrastructure. Recurring title motifs include ai, production, engineering, and kubernetes.

Key claims

The archive repeatedly argues that production only creates leverage when it is wired into an existing workflow.
Early posts lean on production and kubernetes, while newer posts lean on ai and production as constraints shifted.
This topic repeatedly intersects with ai, llm, and infrastructure, so design choices here rarely stand alone.

Practical checklist

Define quality gates up front: eval sets, guardrails, and explicit rollback criteria.
Start with the newest post to calibrate current constraints, then backtrack to older entries for first principles.
When boundary questions appear, cross-read ai and llm before committing implementation details.

Failure modes

Shipping agent behavior without hard boundaries for tools, data access, and approvals.
Optimizing for model novelty while ignoring reliability, latency, or cost drift.
Applying guidance from 2016 to 2026 without revisiting assumptions as context changed.

References

AI Security: Evolving Threats and Defenses

Feb 2026

As of late February 2026, AI security is defined by adaptive attacks and layered, operational defenses.

AI-Native Architecture Patterns 2026

Jan 2026

As of late January 2026, AI-native architecture is a stable discipline with repeatable patterns for delivery, safety, and change management.

AI Video Applications in Practice

Jan 2026

Video AI is practical for scoped workflows. This post covers what works, how to design for reliability, and where human review still matters.

AI Incidents Don't Look Like Outages. That's the Problem.

Nov 2025

Your AI system can return 200 OK and still be wrong, unsafe, or confidently hallucinating. Here's how to detect, contain, and learn from AI incidents -- drawing from the same IR principles that work for traditional systems.

AI Workflow Automation: Decisions Are Cheap, Actions Are Expensive

Aug 2025

The trick to AI workflow automation is simple: let the model decide, let deterministic code act, and never confuse the two.

AI Customer Support That Doesn't Make People Hate You

Jun 2025

Most AI support systems are built to deflect tickets. The ones that actually work are built around escalation, grounding, and the simple idea that customers aren't idiots.

AI Security: Same Principles, New Attack Surface

Apr 2025

AI systems are exposed APIs with real blast radius. The threats are injection, leakage, and tool misuse. The defenses are the same ones we've always needed -- just applied to a new surface.

Testing AI Where It Actually Runs

Apr 2025

Offline evals are necessary but not sufficient. Here's how I test AI features in production with shadow mode, canaries, and rollback automation -- with Go code.

Your AI System Looks Healthy. It Is Not.

Mar 2025

Traditional monitoring will tell you your AI service is up. It won't tell you it's returning confident garbage. Here's what observability actually looks like for AI.

Reasoning Models in Production: A Practical Guide

Jan 2025

Reasoning models are powerful but expensive and slow. Here's how I integrate them in Go services with routing, async patterns, and cost controls that actually work.

Your AI Infrastructure Is Not Special

Dec 2024

AI infrastructure at scale is just infrastructure. The same boring patterns -- gateways, caching, circuit breakers, budget enforcement -- solve the same boring problems.

AI Safety Is Just Production Engineering

Nov 2024

AI safety in production isn't a research problem. It's defense in depth, the same way cyber defense works -- layered controls, assumed breach, observable boundaries.

Function Calling Patterns That Survive Production

Jul 2024

Function calling is how LLMs touch real systems. Treat tools like APIs, arguments like untrusted input, and permissions like the model is an intern with root access.

Agentic Workflows: From Demo Magic to Production Reality

Apr 2024

AI agents that can take actions are fundamentally different from chatbots. The engineering bar must match the blast radius.

Why I Run Multiple Models in Production

Mar 2024

Betting on a single model provider is like having a single database with no failover. Here is why multi-model is the only sane production strategy.

AI Engineering Is Its Own Discipline Now

Jan 2024

AI engineering is not ML research with a product hat. It is the discipline of making models behave in production -- and it demands its own skill set.

LLM Observability: Your Existing Monitoring Is Not Enough

Aug 2023

Traditional monitoring tells you the service is up. It doesn't tell you the model started confidently returning garbage last Tuesday. Here's how to actually observe LLM systems.

AI in Production Is Just Engineering. Treat It That Way.

Jan 2023

ChatGPT changed expectations overnight, but shipping AI features that actually work is an engineering problem, not a model problem.

Your Staging Environment Is Lying to You

Jun 2019

Staging never catches the real bugs. Here's how I learned to test in production without burning everything down.

The Boring Kubernetes Checklist That Actually Keeps Production Alive

Jan 2019

Most Kubernetes outages come from skipping the basics. Here's the checklist I use after running clusters at the fintech startup and now at Decloud.

GraphQL in Production Is Harder Than They Tell You

Jun 2018

After a year running GraphQL at the fintech startup, here's what the conference talks leave out.

Two Years of Kubernetes in Production — The Boring Parts Are the Hard Parts

Jan 2018

Year two of running Kubernetes at the fintech startup. The panic is gone. Now it's networking, resource tuning, and all the operational grunt work nobody blogs about.

A Year Running Kubernetes in Production — What Actually Happened

Jan 2017

After a year of running Kubernetes in production, the wins are real but the sharp edges drew blood first. Here's what paid off, what bit us, and what I'd do differently.

Why We Deleted 42 Grafana Panels

Dec 2016

Most teams monitor too much and alert on the wrong things. Five metrics are enough to run a startup backend.

Building Resilient Systems: Lessons from Production Failures

Jul 2016

Production incidents show where architecture bends and where it breaks. These lessons focus on designing for failure, limiting blast radius, and making recovery routine.

Docker in Production: What We Learned Running Containers at Dropbyke

Feb 2016

Running Docker in production at Dropbyke forced us to get serious about image builds, container networking, log aggregation, and security. Here is what actually worked.