When to Fine-Tune (And When Not To) — Fine-Tuning LLMs: From Data to Deployment

The Three Approaches to Customizing LLMs

When you need an LLM to behave differently from its default behavior, you have three fundamental options: prompt engineering, retrieval-augmented generation (RAG), and fine-tuning. Each has distinct strengths, and choosing the wrong one wastes time, money, and effort. This lesson gives you a clear decision framework so you never commit to fine-tuning when a well-crafted system prompt would have solved the problem — and vice versa.

Prompt Engineering

Prompt engineering is the lightest-touch approach. You write detailed system prompts, provide few-shot examples, and structure your instructions to get the model to behave the way you want. It requires no training, no GPUs, and no dataset.

Best for: General-purpose tasks, rapid prototyping, tasks where requirements change frequently, small teams without ML infrastructure.

Limitations: Context window has a finite size. Complex behavior requires long prompts that eat into your token budget. The model may not consistently follow intricate formatting rules. You pay for prompt tokens on every single request.

Retrieval-Augmented Generation (RAG)

RAG augments the model's knowledge by retrieving relevant documents at query time and including them in the context. The model itself remains unchanged — you are feeding it external knowledge dynamically.

Best for: Knowledge-intensive tasks (customer support over documentation, legal research, internal knowledge bases), scenarios where the information changes frequently, cases where you need citations and source attribution.

Limitations: Retrieval quality is a bottleneck — if the right documents are not retrieved, the answer suffers. Does not change the model's style, tone, or reasoning patterns. Adds latency from the retrieval step. Complex to maintain at scale (chunking strategy, embedding model selection, index updates).

Fine-Tuning

Fine-tuning modifies the model's weights so it inherently behaves differently. The behavior is baked into the model rather than injected through the prompt or context.

Best for: Consistent style or tone requirements, domain-specific vocabulary and jargon, specific output formats (structured JSON, medical reports, legal clauses), latency-sensitive applications where long prompts are too slow, cost reduction when you are currently using long system prompts on every request.

Limitations: Requires a quality dataset (typically 200+ examples minimum). Needs GPU compute for training. The model can overfit or lose general capabilities. Updates require retraining.

Signs You Need Fine-Tuning

Not every project benefits from fine-tuning. Here are the concrete signals that tell you it is time:

Consistent style or voice. Your application needs to always respond in a specific tone — a medical assistant that uses clinical language, a customer-facing bot that matches your brand voice, a legal tool that writes in formal contract language. Prompt engineering can approximate this, but fine-tuning makes it reliable.
Domain-specific jargon. Your field has specialized terminology that the base model handles awkwardly. Financial models, biomedical text, manufacturing processes — fine-tuning teaches the model to use these terms naturally rather than treating them as unusual vocabulary.
Specific output format. You need structured outputs that follow exact schemas — particular JSON structures, XML templates, tabular formats, or report layouts. Fine-tuning makes format compliance near-automatic rather than requiring elaborate prompt instructions.
Latency requirements. If your current solution uses a 2,000-token system prompt to get acceptable behavior, that prompt adds latency and cost to every request. Fine-tuning those instructions into the model eliminates that overhead.
Cost at scale. When you are making thousands of API calls per day with long prompts, fine-tuning a smaller model to match the quality of a larger model with elaborate prompting can dramatically reduce costs.
Task-specific reasoning. The model needs to follow a particular reasoning pattern — a specific chain of thought for medical diagnosis, a particular framework for code review, a defined methodology for risk assessment.

Signs You Should NOT Fine-Tune

Equally important is knowing when to avoid fine-tuning:

Small or low-quality dataset. If you have fewer than 100 high-quality examples, fine-tuning is likely to overfit or produce negligible improvement. Start with prompt engineering and collect more data over time.
Frequently changing requirements. If the behavior you need changes every week, retraining constantly is impractical. Use prompts or RAG instead.
General knowledge tasks. If you need the model to know about current events, recent documentation, or a large corpus of information, RAG is the right tool. Fine-tuning does not reliably inject factual knowledge — it changes behavior, not knowledge.
No clear evaluation criteria. If you cannot define what "good output" looks like for your task, you cannot build a dataset, and you cannot measure whether fine-tuning helped. Define your success metrics first.
Budget constraints with no GPU access. While QLoRA has made fine-tuning accessible on consumer hardware, you still need at least a 16GB GPU (or cloud equivalent). If that is not available, focus on prompt engineering.

Decision Flowchart

Use this sequence of questions to determine your approach:

1. Can few-shot prompting solve the task adequately?
   YES -> Use prompt engineering. Stop here.
   NO  -> Continue.

2. Is the main problem a lack of knowledge/information?
   YES -> Implement RAG. Stop here.
   NO  -> Continue.

3. Is the main problem behavior, style, format, or reasoning?
   YES -> Continue to question 4.
   NO  -> Revisit your problem definition.

4. Do you have 200+ high-quality examples of desired behavior?
   YES -> Fine-tune. Proceed to the rest of this course.
   NO  -> Collect more data. Use prompt engineering in the meantime.

Cost/Benefit Analysis

Here is a realistic comparison for a production application handling 10,000 requests per day:

| Approach | Upfront Cost | Per-Request Cost | Maintenance | Quality Consistency | |----------|-------------|-----------------|-------------|-------------------| | Prompt engineering (GPT-4o) | $0 | High (long prompts) | Low | Medium | | RAG + smaller model | Medium (infra) | Medium | High (index updates) | Medium-High | | Fine-tuned smaller model | Medium (training) | Low (short prompts) | Medium (retrain) | High |

The sweet spot for fine-tuning is when you have stable requirements, a well-defined task, and enough data to train on. The ongoing cost savings from using shorter prompts with a fine-tuned model often pay for the training investment within weeks.

Combining Approaches

These three approaches are not mutually exclusive. In fact, the most powerful production systems combine them:

Fine-tuning + RAG: Fine-tune for style and format, use RAG for knowledge. A medical assistant fine-tuned on clinical writing style that retrieves from the latest medical literature.
Fine-tuning + prompt engineering: Fine-tune for base behavior, use prompts for per-request customization. A customer service model fine-tuned on your company's tone that receives customer context through the prompt.
All three: Fine-tuned model with RAG for knowledge and careful system prompts for edge-case handling.

Practical Tip

Before committing to fine-tuning, run this experiment: Take your 20 best examples of desired behavior and use them as few-shot examples in a prompt. Test against 50 held-out examples. If the few-shot approach achieves 90%+ of your quality target, prompt engineering may be sufficient. If it falls short, you have strong evidence that fine-tuning will add value — and you have already started building your training dataset.

In the next lesson, we will survey the landscape of fine-tuning techniques so you can choose the right method for your specific constraints.