I wrote about the true cost of technical debt back in 2016. The core argument was simple: if you can’t put a number on your debt, you can’t make a rational decision about it. Measure the pain, do the math, and present the tradeoff.
That advice still holds. But AI debt is a different animal, and it’s making me angry.
With traditional tech debt, at least you can see it. Messy code. Missing tests. A module everyone dreads touching. The debt is in the codebase. You can grep for it. You can point to it in a PR review.
AI debt hides. It hides in prompts copy-pasted from a demo and never documented. In evaluations that were “planned for next sprint” six months ago. In embeddings that went stale when source docs changed and nobody re-indexed. In retrieval pipelines where data drifted so gradually that answers went from “good” to “plausible” to “confidently wrong,” and nobody noticed until a customer complained.
The system is still up. It still returns 200 OK. And it’s slowly poisoning your product.
The four kinds of AI debt that keep showing up
Prompt debt. Someone wrote a prompt that worked. They shipped it. Three model versions later, it still “works,” but the behavior has shifted in ways nobody documented because nobody was measuring. The prompt has magic strings nobody can explain. Changing a single sentence now requires a full regression test nobody has time for, so nobody changes anything, and the prompt becomes legacy code that happens to be written in English.
Eval debt. This one drives me up the wall. Teams ship AI features with no evaluation suite. Then they argue about quality using anecdotes. “It seemed fine when I tried it.” That’s not engineering; that’s vibes. Without evals, you can’t tell if your last change made things better or worse. You’re flying blind and calling it agile.
Data and pipeline debt. Stale embeddings. Missing documents. Labeling standards that drifted. The retrieval layer quietly degrades, and because LLMs are so good at sounding confident, nobody notices that answers are getting worse. This is the most insidious form because it’s silent. The system doesn’t crash. It just gets less trustworthy.
Architecture debt. The model interface is hard-coded three layers deep. Tool calls are embedded in application logic. Swapping a provider or upgrading a model feels like open-heart surgery. So teams avoid improvements entirely. The system calcifies.
How to actually fix this
The same way you fix any tech debt. Not with a heroic rewrite. With discipline.
Version your prompts like code. Put them in the repo. Give them owners. Document the intent, not just the text. When someone changes a prompt, they should write down why, and what eval signals should remain stable. This isn’t bureaucracy. It’s how you stop mystery regressions.
Build evals before you ship. Start with a small set of real examples and documented expected outcomes. Run them on every meaningful change. It doesn’t need to be elaborate. It needs to be consistent. Teams that do this – even just 20-30 test cases – move faster because they know what is safe to change.
Decouple the model interface. Abstract it. Separate retrieval from response logic. That lets you swap providers, test with mocks, and upgrade models without touching core flows. It also makes your system testable, which is the whole point.
Monitor freshness alongside quality. Track when your embeddings were last updated. Track retrieval relevance scores. If your data pipeline is stale, your outputs are stale, no matter how good the model is.
The uncomfortable part
Most teams accumulate AI debt because they shipped under pressure and told themselves they’d clean it up later. I’ve been guilty of this. Early on at a startup I ran, we had prompts that worked “well enough” and no eval suite for weeks. The reckoning came when we swapped model versions and spent three days figuring out what broke because we had no baseline to compare against.
The fix isn’t a cleanup sprint. It’s a steady cadence. Fifteen percent of capacity toward debt work, same as I recommended in 2016. Review prompt changes with rationale. Run evals on every release. Monitor quality signals and data freshness together.
AI debt is manageable. But it requires intention. If every small change to your AI system feels risky, you already have a debt problem. The path forward isn’t heroic rewrites. It’s a steady sequence of small, documented improvements.
Steady beats dramatic. Every time.