The Fine-Tuning Landscape — Fine-Tuning LLMs: From Data to Deployment

From Full Fine-Tuning to PEFT

The fine-tuning landscape has evolved dramatically. In 2022, fine-tuning a model meant updating every single weight — requiring massive GPU clusters and enterprise budgets. Today, parameter-efficient methods let you fine-tune a 70B model on a single consumer GPU. Understanding the full spectrum of techniques helps you choose the right method for your compute budget, quality requirements, and deployment constraints.

Full Fine-Tuning

Full fine-tuning updates every parameter in the model. For a 7B parameter model, that means modifying all 7 billion weights during training. This is the gold standard in terms of quality — the model has maximum freedom to adapt to your data.

Requirements: For a 7B model in float16, you need roughly 14GB just for model weights, plus 2-3x that for optimizer states and gradients. Realistically, you need 80GB+ VRAM (an A100 80GB or multiple GPUs). For 70B models, you are looking at a multi-GPU setup with 4-8 A100s.

When to use: When you have abundant compute, a large dataset (10,000+ examples), and you need maximum adaptation. Common in enterprise settings or when creating a specialized foundation model.

Drawbacks: Expensive, slow, risk of catastrophic forgetting (the model loses general capabilities), and the result is a completely new model that you must store and serve at full size.

Parameter-Efficient Fine-Tuning (PEFT)

PEFT methods freeze most of the model's weights and only train a small number of additional or modified parameters. The key insight is that the changes needed for task adaptation often lie in a low-dimensional subspace — you do not need to move all 7 billion parameters to teach the model a new behavior.

LoRA (Low-Rank Adaptation)

LoRA is the most popular PEFT method and the one you will use most in practice. Instead of modifying a weight matrix W directly, LoRA decomposes the update into two smaller matrices: a down-projection A and an up-projection B, where the rank r is much smaller than the original dimensions.

# Conceptual: instead of updating W (d x d), LoRA learns:
# W' = W + (B @ A) where B is (d x r) and A is (r x d)
# For r=16 and d=4096: 4096*4096 = 16.7M params vs 2*4096*16 = 131K params

This means that for a rank of 16 on a 4096-dimensional layer, you train 131K parameters instead of 16.7M — a 99.2% reduction. The trained LoRA weights are typically 10-50MB, compared to the 14GB base model.

Typical configuration: Rank 16-64, applied to attention layers (q_proj, v_proj, and often k_proj, o_proj). Higher ranks give more capacity but increase memory and training time.

QLoRA (Quantized LoRA)

QLoRA combines LoRA with 4-bit quantization of the base model. The base model is loaded in 4-bit precision (NF4 data type), and LoRA adapters are trained in float16/bfloat16 on top. This reduces the memory needed for the base model by roughly 4x.

Memory comparison for a 7B model:

| Method | Base Model Memory | Trainable Params | Total Training VRAM | |--------|------------------|-----------------|-------------------| | Full fine-tuning (fp16) | ~14 GB | 7B | ~80 GB | | LoRA (fp16) | ~14 GB | ~20M | ~18 GB | | QLoRA (4-bit) | ~3.5 GB | ~20M | ~6 GB |

QLoRA makes it possible to fine-tune a 7B model on a single RTX 3090 (24GB) or even a T4 (16GB) with careful batch size management. For 70B models, QLoRA brings the requirement down to a single A100 40GB.

DoRA (Weight-Decomposed Low-Rank Adaptation)

DoRA is a newer technique (2024) that decomposes the weight update into magnitude and direction components. It consistently outperforms LoRA at the same rank with minimal additional overhead. The intuition is that separating "how much" to change from "which direction" to change gives the optimizer more flexibility.

from peft import LoraConfig

# DoRA configuration — uses the same interface as LoRA
config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    use_dora=True,  # Enable DoRA
    lora_dropout=0.05,
)

When to use: When you want better quality than LoRA without significantly increasing compute. DoRA is becoming the default recommendation for new fine-tuning projects.

Other PEFT Methods

Adapters: Small bottleneck layers inserted between transformer blocks. Older approach, largely superseded by LoRA.
Prefix tuning: Learns virtual tokens prepended to each layer's input. Useful for multi-task scenarios but generally lower quality than LoRA.
Prompt tuning: Learns soft prompt embeddings that are prepended to the input. Lightest-weight method but limited adaptation capacity.
IA3: Learns scaling vectors for keys, values, and feed-forward layers. Even fewer parameters than LoRA but more limited.

Which Models Can Be Fine-Tuned

Not all models are equal in the fine-tuning ecosystem. Here are the most commonly fine-tuned model families:

Open-Weight Models (Full Access)

Llama 3 / 3.1 / 3.2 (Meta): The most popular base for fine-tuning. Available in 1B, 3B, 8B, and 70B sizes. Excellent base quality, strong community support, and permissive licensing.
Mistral / Mixtral (Mistral AI): Mistral 7B and Mixtral 8x7B are efficient and fine-tune well. Mistral's architecture innovations (sliding window attention) translate well to fine-tuned variants.
Qwen 2 / 2.5 (Alibaba): Strong multilingual capabilities. Excellent choice for non-English fine-tuning tasks. Available up to 72B.
Gemma 2 (Google): Clean architecture, strong performance for size. Available in 2B, 9B, and 27B. Good licensing for commercial use.
Phi-3 / Phi-4 (Microsoft): Small but capable. Phi-3 Mini (3.8B) and Phi-4 (14B) punch above their weight. Good choice when you need a small deployed model.

API-Only Fine-Tuning

GPT-4o / GPT-4o-mini (OpenAI): Fine-tuning through the API. Limited control — you submit data, OpenAI trains and hosts the model. No access to weights.
Claude (Anthropic): Fine-tuning available for enterprise customers. Similar API-based approach.

Choosing a Base Model

Your choice of base model depends on several factors:

Deployment constraint    -> Model recommendation
-------------------------------------------------
Edge / mobile            -> Phi-3 Mini (3.8B), Llama 3.2 (1B/3B)
Single consumer GPU      -> Llama 3 8B, Mistral 7B, Gemma 2 9B
Single data center GPU   -> Llama 3 70B, Qwen 2.5 72B (with QLoRA)
Multilingual required    -> Qwen 2.5, Llama 3.1
Maximum quality          -> Llama 3.1 70B, Qwen 2.5 72B

The Evolution of Accessibility

To appreciate where we are, consider the timeline:

2020-2022: Fine-tuning required multi-GPU clusters. Only large companies and research labs could do it meaningfully.
2023: LoRA and QLoRA democratized fine-tuning. A single A100 could handle 70B models. Consumer GPUs could handle 7B models.
2024-2025: Tools like Unsloth, Axolotl, and LLaMA-Factory made fine-tuning accessible through simple configuration files. DoRA improved quality. Flash Attention reduced memory further.
2026: Fine-tuning a 7B model takes 30 minutes on a free Colab T4. The barrier is now data quality, not compute.

Practical Decision Matrix

Use this to choose your method:

| Your Situation | Recommended Approach | |----------------|---------------------| | 16GB VRAM, 7B model | QLoRA, rank 16-32 | | 24GB VRAM, 7B model | LoRA or QLoRA, rank 32-64 | | 40-80GB VRAM, 7B model | LoRA (fp16), rank 64+ | | 24GB VRAM, 70B model | Not feasible for training | | 40GB VRAM, 70B model | QLoRA, rank 16 | | 80GB+ VRAM, 70B model | QLoRA rank 32+ or LoRA | | Multiple A100s, 70B model | Full fine-tuning or LoRA fp16 |

For most practitioners, QLoRA with rank 16-32 on an 8B model is the starting point. It is fast, affordable, and produces excellent results for most use cases.

In the next lesson, we dive into the most critical part of any fine-tuning project: your data.