Last quarter I helped a team migrate a large Go codebase from an internal HTTP framework to standard library patterns: around 200K lines across 40+ services. It was the kind of project where you know the end state, you know the transformation rules, and the work is 90% mechanical and 10% judgment calls that keep you up at night.
We used LLMs to handle the mechanical 90%. It worked. But “it worked” comes with enough caveats that it’s worth being honest about what actually happened.
What the AI was good at
Pattern matching and consistent transformation are the sweet spot. We had about 15 distinct patterns that needed to change: custom route handlers to standard ones, middleware signatures, and error response formats. For each pattern, we wrote a clear transformation rule with before/after examples.
The LLM could take a file, identify which patterns were present, and produce a transformed version. For straightforward cases, it was faster than any human and more consistent. It didn’t get bored on file 200. It didn’t introduce typos. It applied the same transformation rule the same way every time.
We processed about 300 files in two days that would have taken two engineers a couple of weeks. The mechanical savings were real.
What the AI was bad at
Judgment. The 10% of cases that didn’t fit neatly into the transformation rules required understanding intent, not just pattern matching: a handler that looked standard but had a subtle side effect; a middleware chained in an unusual order for a specific reason; error handling intentionally different from the standard pattern because of a business rule documented nowhere except a Slack thread from 2021.
The LLM would happily transform these cases using the standard rules. The output would compile. The tests would pass. And the behavior would be subtly wrong in ways that only surfaced under specific conditions.
This is the dangerous part. AI-generated code that’s almost right is harder to catch than code that’s obviously wrong. It passes automated checks and casual review. Then you find the bug three weeks later when a customer reports something weird.
The workflow that worked
Here’s what we settled on after the first batch of surprises:
Step 1: Scope with samples. Don’t start with “migrate everything.” Pick 10 representative files that cover the range of patterns. Run them through the LLM. Review the output manually. This reveals the transformation rules you need and the edge cases you’ll need to handle differently.
Step 2: One rule per pattern. Write each transformation rule explicitly. Not “update the HTTP handlers,” but “replace framework.Handler(func(ctx *Ctx) error {...}) with http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {...}) and move error handling to…” The more specific the rule, the better the LLM follows it.
Step 3: Small batches, continuous validation. We processed 10-20 files at a time. After each batch: run the build, run the tests, run the linter, and do a quick diff review. If something broke, fix it and update the transformation rule before continuing. Don’t accumulate 200 files of changes and then try to debug a test failure.
Step 4: Flag the hard ones. When the LLM produced a transformation that looked different from the standard pattern, we flagged it for human review instead of forcing it through. About 15% of files got flagged. Those were the ones where the AI saved us no time at all – but catching them early saved us from a lot of pain later.
Treat AI output as draft code
This is the principle that made the whole process work. Every AI-generated change went through the same review process as a human-written change. Same CI checks. Same code review. Same approval workflow.
The temptation is to trust the AI more because it’s consistent and fast. Resist that temptation. The AI is a junior engineer who types incredibly fast and never pushes back on your instructions. That’s useful. It isn’t the same as reliable.
What I’d do differently
I’d build the evaluation harness first. We started the migration, then realized we didn’t have a good way to verify that migrated services behaved identically to the originals. We retrofitted integration tests, but it would have been faster to invest that time upfront.
I’d also version the transformation rules alongside the code. We iterated on the rules as we discovered edge cases, but we didn’t track which version of the rules produced which batch of changes. When we found a bug, tracing it back to the specific rule version that caused it was harder than it should have been.
The honest summary
AI made a two-month migration take three weeks. That’s a genuine win. But it didn’t change the nature of the hard parts. Scoping, validation, edge case handling, and human judgment on ambiguous cases – those are still the bottleneck. The AI accelerated the parts that were already straightforward.
Use AI for migrations. Just don’t pretend it replaces the discipline that makes migrations safe.