Testing AI Where It Actually Runs
Offline evals are necessary but not sufficient. Here's how I test AI features in production with shadow mode, canaries, and rollback automation -- with Go code.
Testing coverage in this archive spans 8 posts from Aug 2017 to Apr 2025 and leans into practical engineering craft: interfaces, testing, and maintainable implementation details. The strongest adjacent threads are ai, quality, and go. Recurring title motifs include testing, llm, lying, and ai.
Offline evals are necessary but not sufficient. Here's how I test AI features in production with shadow mode, canaries, and rollback automation -- with Go code.
LLM outputs are non-deterministic. That doesn't mean you can't test them rigorously. Here's the layered testing approach I use in production.
Your LLM feature looks great in demos and breaks in production. Here is how to build an evaluation loop that catches regressions before your users do.
Microservices fail at the seams. A layered test strategy that keeps feedback fast and catches integration issues before production.
I tested Terraform modules with unit checks, policy engines, and full integration runs side by side. Here's what each approach actually catches and what it misses.
Most load tests produce comforting numbers instead of useful answers. Here's what I learned the hard way about getting honest results.
Staging never catches the real bugs. Here's how I learned to test in production without burning everything down.
Chaos engineering isn't just for the big players. Here's how a small team can start breaking things deliberately and actually learn from it.