Stop Fine-Tuning Models You Haven't Bothered to Prompt Properly

| 4 min read |
fine-tuning llm ai rant

Fine-tuning is the goto move for teams who skipped the basics. Most of the time, better prompts and proper retrieval solve the actual problem.

I need to get something off my chest. I’ve reviewed six AI projects in the last two months where teams jumped straight to fine-tuning. Six. Not one of them had tried proper few-shot prompting first. Not one had a retrieval layer for domain knowledge. They saw “the model doesn’t know our stuff” and immediately reached for the most expensive, most maintenance-heavy tool in the shed.

This drives me nuts.

Fine-Tuning Isn’t a Knowledge Injection

Let me say this clearly: fine-tuning changes behavior, not knowledge. If your problem is “the model doesn’t know about our product,” the answer is retrieval. RAG. Grounding. Whatever you want to call it, feed the model your docs at inference time.

Fine-tuning bakes patterns into weights. It’s good for consistent tone, strict output formats, and narrow tasks repeated at massive scale. It’s terrible for facts that change, knowledge that needs updating, or anything where you want to point at a source and say “the answer came from here.”

I’ve watched teams spend weeks curating training data to teach a model their product catalog. Then the catalog changes. Now the model confidently recommends products that no longer exist. Retrieval would have solved this in an afternoon.

The Decision Is Simple

Before you fine-tune anything, answer these questions honestly:

Have you pushed the prompt hard? Not a one-liner. A real system prompt with role definition, constraints, examples, and output format. Most teams write a lazy prompt, get mediocre results, and conclude the model needs training. No. Their prompt needs training.

Have you added retrieval? If the issue is domain knowledge, factual accuracy, or up-to-date information, retrieval is the answer. Fine-tuning can’t compete with a well-indexed knowledge base for factual tasks.

Is the remaining gap about behavior? After good prompts and solid retrieval, if the model still can’t hold a consistent tone, reliably produce a specific output structure, or stop drifting on a narrow repeated task, now we can talk about fine-tuning.

Is the volume worth it? Fine-tuning has upfront cost and ongoing maintenance. If the task runs ten times a day, just use a better prompt. If it runs ten thousand times a day and prompt tokens are eating your budget, fine-tuning starts to make economic sense.

The Maintenance Tax Nobody Mentions

Here’s what the fine-tuning tutorials leave out. A tuned model is a versioned product. Your training data reflects a snapshot of your business at a moment in time. Products change. Policies change. Customer expectations change. Your training set drifts.

That means you need:

  • Versioned training sets in source control
  • A holdout evaluation set that you run against every new version
  • Monitoring for quality regression in production
  • A refresh cadence that’s actually budgeted and scheduled

I’ve seen exactly one team do all of this well. Everyone else fine-tuned once, celebrated, and then watched quality slowly degrade over three months while nobody noticed because nobody was measuring.

When I Actually Recommend It

I’m not anti-fine-tuning. I’m anti-premature-fine-tuning. The legitimate cases exist:

  • You need a specific voice or brand tone that holds across thousands of outputs and few-shot examples aren’t stable enough
  • You have a narrow classification or extraction task at high volume where shaving prompt tokens saves real money
  • You need a strict output schema and the base model keeps introducing creative variations despite explicit instructions

From what I’ve seen, maybe one in five projects that ask about fine-tuning actually need it. The rest need better prompts, proper retrieval, or both.

The Honest Checklist

  1. Write a real system prompt with examples and constraints. Test it on 50 representative inputs.
  2. If factual accuracy is the gap, add retrieval. Test again.
  3. If behavior consistency is still the gap at high volume, collect 200+ high-quality examples that match real production inputs.
  4. Hold out 20% for evaluation. Fine-tune. Compare against the base model on both your target metric and general reasoning.
  5. If the tuned model wins on behavior but loses on reasoning, reconsider whether the tradeoff is worth it.
  6. Version everything. Monitor everything. Schedule refreshes.

Stop treating fine-tuning as step one. It’s step last.