Training with Unsloth — Fine-Tuning LLMs: From Data to Deployment

Why Unsloth Exists

The standard Hugging Face training stack works, but it leaves performance on the table. Unsloth rewrites key parts of the training pipeline with custom CUDA kernels and optimized memory management, achieving 2x faster training and up to 60% less memory usage — with zero change in model quality. If you are fine-tuning on consumer hardware or want to minimize cloud GPU costs, Unsloth is the tool you should learn.

What Makes Unsloth Fast

Unsloth's speed comes from several optimizations:

Custom Triton kernels for cross-entropy loss and LoRA layers, eliminating unnecessary memory allocations
Fused operations that combine multiple steps (RoPE embedding, layer normalization, LoRA forward pass) into single GPU calls
Intelligent memory management that reuses buffers instead of allocating new ones
Optimized backward pass that computes gradients more efficiently for LoRA parameters

These optimizations are applied automatically when you load a model through Unsloth's API. You do not need to change your training code beyond the model loading step.

Installation

pip install unsloth

For specific CUDA versions or if you encounter issues:

# For CUDA 12.1
pip install unsloth[cu121-ampere-torch250]

# For CUDA 11.8
pip install unsloth[cu118-torch250]

Unsloth also works in Google Colab and Kaggle notebooks with no special configuration.

Supported Models

Unsloth supports the most popular open-weight model families:

Llama 3 / 3.1 / 3.2 (1B, 3B, 8B, 70B)
Mistral / Mixtral (7B, 8x7B)
Qwen 2 / 2.5 (0.5B to 72B)
Gemma 2 (2B, 9B, 27B)
Phi-3 / Phi-4 (3.8B, 14B)

Unsloth provides pre-quantized 4-bit versions of these models that load faster than quantizing from scratch:

# Unsloth's pre-quantized models (recommended)
"unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"
"unsloth/Mistral-7B-Instruct-v0.3-bnb-4bit"
"unsloth/Qwen2.5-7B-Instruct-bnb-4bit"
"unsloth/gemma-2-9b-it-bnb-4bit"

FastLanguageModel: Loading and Configuration

Unsloth replaces the standard HF model loading with FastLanguageModel:

from unsloth import FastLanguageModel
import torch

# Load model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length=2048,
    dtype=None,           # Auto-detect (bfloat16 on Ampere+, float16 otherwise)
    load_in_4bit=True,    # QLoRA
)

This single call handles everything: model loading, quantization, tokenizer setup, and memory optimization.

Applying LoRA

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=32,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.0,     # Unsloth recommends 0 for speed
    bias="none",
    use_gradient_checkpointing="unsloth",  # Unsloth's optimized version
    random_state=42,
)

Key difference: Unsloth recommends lora_dropout=0.0 because their optimized kernels are fastest without dropout. If you need regularization, use a smaller learning rate or early stopping instead.

The use_gradient_checkpointing="unsloth" flag enables Unsloth's custom gradient checkpointing that is 30% faster than the standard implementation.

Training Configuration

Unsloth uses the standard Hugging Face SFTTrainer, so the training code is nearly identical:

from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

# Load dataset
dataset = load_dataset("json", data_files={
    "train": "data/train.jsonl",
    "validation": "data/val.jsonl",
})

training_args = TrainingArguments(
    output_dir="./output/llama3-unsloth",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=2e-4,
    lr_scheduler_type="linear",
    warmup_steps=10,
    optim="adamw_8bit",
    weight_decay=0.01,
    bf16=True,
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=100,
    save_total_limit=3,
    seed=42,
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    max_seq_length=2048,
    packing=False,
    dataset_text_field=None,
)

# Train
trainer.train()

Complete Example: Fine-Tuning Llama 3 on Custom Data

Here is a complete, self-contained script:

"""
Complete Unsloth fine-tuning script for Llama 3.1 8B.
Requires: pip install unsloth trl datasets
Hardware: Works on 16GB VRAM (T4, RTX 4060, etc.)
"""
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
import torch

# === 1. Load Model ===
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

# === 2. Apply LoRA ===
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=32,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

# === 3. Load Dataset ===
dataset = load_dataset("json", data_files={
    "train": "data/train.jsonl",
    "validation": "data/val.jsonl",
})

# === 4. Training ===
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=TrainingArguments(
        output_dir="./output/llama3-custom",
        num_train_epochs=3,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        learning_rate=2e-4,
        lr_scheduler_type="linear",
        warmup_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        bf16=True,
        logging_steps=10,
        eval_strategy="steps",
        eval_steps=50,
        save_strategy="steps",
        save_steps=100,
        save_total_limit=3,
        seed=42,
    ),
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    max_seq_length=2048,
)

trainer.train()

# === 5. Save LoRA Adapter ===
model.save_pretrained("./output/llama3-custom/final")
tokenizer.save_pretrained("./output/llama3-custom/final")

# === 6. Test the Model ===
FastLanguageModel.for_inference(model)  # Enable fast inference mode

messages = [
    {"role": "user", "content": "Explain what fine-tuning is."}
]

inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True
).to("cuda")

outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=256,
    temperature=0.7,
    do_sample=True,
)

print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

Comparison: Unsloth vs Vanilla Hugging Face

Benchmarks on an RTX 3090 (24GB), Llama 3.1 8B, QLoRA r=16, batch size 2:

| Metric | Hugging Face | Unsloth | Improvement | |--------|-------------|---------|------------| | Training speed | 1.2 it/s | 2.5 it/s | 2.1x faster | | Peak VRAM | 14.2 GB | 8.1 GB | 43% less | | Total training time (1000 steps) | 14 min | 6.7 min | 2.1x faster | | Final loss | 0.847 | 0.845 | Equivalent |

The quality (measured by final loss and human evaluation) is identical — Unsloth does not cut corners on the math, it just computes the same math more efficiently.

When to Use Unsloth

Use Unsloth when:

You are training on consumer GPUs (16-24GB VRAM)
You want to minimize cloud GPU costs
You are doing rapid iteration and need fast turnaround
You are training on Colab/Kaggle free tier

Use vanilla Hugging Face when:

You need multi-GPU training (Unsloth's multi-GPU support is more limited)
You need custom training loops that modify the training step
You are using a model architecture Unsloth does not support yet
You need DeepSpeed or FSDP integration

Saving and Exporting

Unsloth makes exporting easy:

# Save as LoRA adapter (smallest, for later merging)
model.save_pretrained("output/lora-adapter")

# Save merged model in float16
model.save_pretrained_merged("output/merged-f16", tokenizer, save_method="merged_16bit")

# Save as GGUF for Ollama/llama.cpp
model.save_pretrained_gguf("output/gguf", tokenizer, quantization_method="q4_k_m")

# Push to Hugging Face Hub
model.push_to_hub_merged("your-username/your-model", tokenizer)
model.push_to_hub_gguf("your-username/your-model-gguf", tokenizer, quantization_method="q4_k_m")

This is one of Unsloth's biggest conveniences — converting to GGUF directly from the training script, without a separate conversion step.

In the next lesson, we cover how to monitor your training runs and debug the most common issues that arise during fine-tuning.