I’m going to say something that will annoy AI tooling vendors: most AI code review output is garbage.
Not all of it. Maybe 15-20% is genuinely useful. But the other 80% is vague, style-obsessed, context-free commentary that would get a human reviewer told to try harder. “Consider adding error handling here.” Thanks. I hadn’t considered that. In Go. Where every third line is error handling.
I’ve been running AI review on PRs across production codebases for months. I wanted it to work. I really did. A tireless reviewer that catches logic bugs and security issues while humans focus on architecture and design? Sign me up. The reality is more complicated.
What it actually catches
When AI code review works, it works well. The wins are real:
Logic errors on changed paths. The model is good at spotting off-by-one errors, nil pointer risks, and missing edge cases in the specific lines that changed. It caught a race condition in a Go channel handler that three human reviewers missed. That alone justified the experiment.
Security surface area. SQL injection in a new endpoint. Hardcoded credentials in a test file that was about to be committed. An overly permissive CORS config. These are pattern-matching tasks, and models are decent at pattern matching.
Copy-paste bugs. Someone copies a function, changes three of four parameters, and forgets the fourth. The model catches this reliably. Humans miss it because we read what we expect to see.
Where it falls apart
Business context. The model doesn’t know why your checkout flow has that weird retry logic. It doesn’t know that the “redundant” nil check exists because a specific vendor API lies about its response types. It doesn’t know your system’s history. So it flags things that aren’t problems and misses things that are.
Large diffs. Anything over a few hundred lines and the model loses the thread. It starts making generic observations instead of specific findings. “This function is complex and could benefit from refactoring.” Really helpful on a 2,000-line migration PR.
Style opinions nobody asked for. “Consider using a more descriptive variable name.” “This comment could be more detailed.” “Consider extracting this into a separate function.” If I wanted a style cop, I’d configure a linter. AI review should find bugs, not police style.
How I actually use it
After months of tuning, here’s what works.
Scope it to the diff. Don’t let the model browse the entire repo. Give it the changed lines and maybe the immediate surrounding context. The more you feed it, the more generic the output gets.
Demand specifics. My review prompt is aggressive about this:
Review this diff. For each finding:
- Exact line number
- Severity: critical / warning / info
- What could fail at runtime
- A concrete fix
Skip style suggestions. Skip anything a linter would catch.
If nothing is wrong, say nothing.
That last line matters. Without it, the model will always find something to say because it’s trained to be helpful. Sometimes the most helpful thing is silence.
Track the hit rate. I log every AI review comment and whether the human reviewer accepted, dismissed, or ignored it. Our current acceptance rate is about 22%. That means 78% of AI review output is noise. Not great. But the 22% that lands includes some of the highest-severity findings in our review history.
Never gate merges on it. AI review is advisory. A comment. A suggestion. The human reviewer decides. The moment you make AI review a merge blocker, you’ve handed authority to a system that’s wrong four times out of five. Don’t do this.
The uncomfortable math
AI code review costs money. Token costs, API calls, latency in your CI pipeline. At our current volume, it adds about 15-30 seconds per PR and a few dollars per day. That’s cheap for the bugs it catches. But if you aren’t measuring hit rate, you have no idea whether it’s worth it.
Most teams set up AI review, get excited about the first few catches, and then never look at the numbers again. Six months later, developers have learned to ignore the comments entirely because most of them are noise. The tool becomes furniture.
What I actually want
I want AI code review that knows when to shut up. That understands the system well enough to distinguish a real bug from an intentional design choice. That can read a PR description and connect the changes to the stated intent.
We aren’t there yet. But the foundation is real. Scope it tight, demand specifics, measure ruthlessly, and never trust it to make decisions. It’s a second pair of eyes, not a senior engineer.