Looking back at 2024, the word that keeps coming to mind is “normalization.” AI stopped being the shiny thing leadership wanted to announce and became the thing teams had to maintain. That shift changed everything about how I spent my year.
The Work
Most of my 2024 was hands-on. Telecom, food delivery, real-time communications, fintech – different industries and scales, but the same fundamental questions. How do we go from demo to production? How do we control costs? How do we measure whether this actually works?
The conversations changed dramatically between January and December. Early in the year, the question was what AI could do. By mid-year, it was what AI should do – which tasks justified the cost, the complexity, and the risk. By Q4, the conversations were about operations: monitoring, evaluation cadence, cost attribution, team structure.
That progression felt right, like an industry growing up.
What Held Up
A few things I believed in January that held up through December:
Narrow scope wins. Every successful deployment I saw this year started with a tightly scoped use case. “Classify these support tickets into five categories” beats “build an AI assistant for customer service” every time. The narrow scope forces clear success criteria, which forces real evaluation, which forces real accountability.
Evaluation is the product. Teams that built evaluation harnesses early shipped faster and with more confidence. Teams that skipped evaluation shipped demos that never became products. I’ll keep saying it.
Retrieval quality determines answer quality. I built multiple RAG systems this year. In every single case, the initial complaint was “the model hallucinates” and the actual fix was improving retrieval. Better chunking. Hybrid search. Reranking. The model was fine. The evidence was bad.
Cost control is a day-one concern. I watched one team’s AI bill go from manageable to alarming in six weeks because nobody was tracking per-feature attribution. By the time they noticed, the organizational habit of ignoring cost was already baked in. Much harder to fix after the fact.
What Surprised Me
Claude 3.5 Sonnet changed my default recommendation. For most of the year I was recommending different models for different tasks with complex routing logic. By late 2024, Claude 3.5 Sonnet had become my default “just start here” answer for a wide range of production tasks. The quality-to-cost ratio was hard to beat. I still recommend routing for cost optimization, but the bar for when routing matters got higher.
Open models got good enough to matter. Llama 3 and Mistral variants crossed a threshold this year. Not for everything – frontier tasks still need frontier models. But for classification, extraction, and structured output, open models running on modest hardware became a real option. I helped two teams set up self-hosted deployments where the economics made sense.
Teams overbuilt. This one surprised me less than it should have. Multiple teams built multi-agent orchestration systems for tasks that should have been a single prompt with a good system message. The complexity wasn’t justified by the task. It was justified by enthusiasm. I spent a fair amount of Q3 and Q4 helping teams simplify.
What Stayed Hard
Evaluation is hard. I keep preaching it, and I keep watching teams struggle with it. Building a good eval set requires domain expertise, clear criteria, and the willingness to maintain it over time. Most teams get the first version right, then let it rot. Evaluation sets need the same care as test suites.
Multi-step workflows remained fragile. Agents that need to plan, execute, observe, and adapt are architecturally interesting and operationally painful. The tooling improved this year but the fundamental challenge – maintaining coherence over many steps – is still unsolved. The teams that succeeded constrained the number of steps aggressively.
Hiring remained weird. The “AI engineer” role is still not well-defined. Every company means something different by it. The best hires I saw were strong software engineers who learned the AI-specific parts on the job, not ML researchers who struggled with production engineering.
The Personal Angle
I’m still contributing to Go. Still building tools. The work is rewarding but I miss building full-time sometimes. There’s a different satisfaction in shipping code versus reviewing architecture diagrams.
The problem space – helping teams build faster and ship reliably – feels increasingly important as AI lowers the barrier to starting projects but does nothing to lower the barrier to finishing them. Starting is easy. Shipping is hard. That gap is where I keep ending up.
The Takeaway
2024 was the year AI got boring. I mean that as the highest compliment. Boring means production-ready. Boring means maintainable. Boring means teams can build on top of it without wondering if the foundation will shift next month.
The demo phase is over. The real work is underway. And the teams that win from here are the ones that treat AI for what it is: another production system that needs discipline, measurement, and ownership.
Same as everything else.