Training with Unsloth
Why Unsloth Exists
The standard Hugging Face training stack works, but it leaves performance on the table. Unsloth rewrites key parts of the training pipeline with custom CUDA kernels and optimized memory management, achieving 2x faster training and up to 60% less memory usage — with zero change in model quality. If you are fine-tuning on consumer hardware or want to minimize cloud GPU costs, Unsloth is the tool you should learn.
What Makes Unsloth Fast
Unsloth's speed comes from several optimizations:
- Custom Triton kernels for cross-entropy loss and LoRA layers, eliminating unnecessary memory allocations
- Fused operations that combine multiple steps (RoPE embedding, layer normalization, LoRA forward pass) into single GPU calls
- Intelligent memory management that reuses buffers instead of allocating new ones
- Optimized backward pass that computes gradients more efficiently for LoRA parameters
These optimizations are applied automatically when you load a model through Unsloth's API. You do not need to change your training code beyond the model loading step.
Installation
pip install unsloth
For specific CUDA versions or if you encounter issues:
# For CUDA 12.1
pip install unsloth[cu121-ampere-torch250]
# For CUDA 11.8
pip install unsloth[cu118-torch250]
Unsloth also works in Google Colab and Kaggle notebooks with no special configuration.
Supported Models
Unsloth supports the most popular open-weight model families:
- Llama 3 / 3.1 / 3.2 (1B, 3B, 8B, 70B)
- Mistral / Mixtral (7B, 8x7B)
- Qwen 2 / 2.5 (0.5B to 72B)
- Gemma 2 (2B, 9B, 27B)
- Phi-3 / Phi-4 (3.8B, 14B)
Unsloth provides pre-quantized 4-bit versions of these models that load faster than quantizing from scratch:
# Unsloth's pre-quantized models (recommended)
"unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"
"unsloth/Mistral-7B-Instruct-v0.3-bnb-4bit"
"unsloth/Qwen2.5-7B-Instruct-bnb-4bit"
"unsloth/gemma-2-9b-it-bnb-4bit"
FastLanguageModel: Loading and Configuration
Unsloth replaces the standard HF model loading with FastLanguageModel:
from unsloth import FastLanguageModel
import torch
# Load model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
max_seq_length=2048,
dtype=None, # Auto-detect (bfloat16 on Ampere+, float16 otherwise)
load_in_4bit=True, # QLoRA
)
This single call handles everything: model loading, quantization, tokenizer setup, and memory optimization.
Applying LoRA
model = FastLanguageModel.get_peft_model(
model,
r=16,
lora_alpha=32,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_dropout=0.0, # Unsloth recommends 0 for speed
bias="none",
use_gradient_checkpointing="unsloth", # Unsloth's optimized version
random_state=42,
)
Key difference: Unsloth recommends lora_dropout=0.0 because their optimized kernels are fastest without dropout. If you need regularization, use a smaller learning rate or early stopping instead.
The use_gradient_checkpointing="unsloth" flag enables Unsloth's custom gradient checkpointing that is 30% faster than the standard implementation.
Training Configuration
Unsloth uses the standard Hugging Face SFTTrainer, so the training code is nearly identical:
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
# Load dataset
dataset = load_dataset("json", data_files={
"train": "data/train.jsonl",
"validation": "data/val.jsonl",
})
training_args = TrainingArguments(
output_dir="./output/llama3-unsloth",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=2e-4,
lr_scheduler_type="linear",
warmup_steps=10,
optim="adamw_8bit",
weight_decay=0.01,
bf16=True,
logging_steps=10,
eval_strategy="steps",
eval_steps=50,
save_strategy="steps",
save_steps=100,
save_total_limit=3,
seed=42,
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
max_seq_length=2048,
packing=False,
dataset_text_field=None,
)
# Train
trainer.train()
Complete Example: Fine-Tuning Llama 3 on Custom Data
Here is a complete, self-contained script:
"""
Complete Unsloth fine-tuning script for Llama 3.1 8B.
Requires: pip install unsloth trl datasets
Hardware: Works on 16GB VRAM (T4, RTX 4060, etc.)
"""
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
import torch
# === 1. Load Model ===
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
max_seq_length=2048,
dtype=None,
load_in_4bit=True,
)
# === 2. Apply LoRA ===
model = FastLanguageModel.get_peft_model(
model,
r=16,
lora_alpha=32,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_dropout=0.0,
bias="none",
use_gradient_checkpointing="unsloth",
random_state=42,
)
# === 3. Load Dataset ===
dataset = load_dataset("json", data_files={
"train": "data/train.jsonl",
"validation": "data/val.jsonl",
})
# === 4. Training ===
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
args=TrainingArguments(
output_dir="./output/llama3-custom",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=2e-4,
lr_scheduler_type="linear",
warmup_steps=10,
optim="adamw_8bit",
weight_decay=0.01,
bf16=True,
logging_steps=10,
eval_strategy="steps",
eval_steps=50,
save_strategy="steps",
save_steps=100,
save_total_limit=3,
seed=42,
),
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
max_seq_length=2048,
)
trainer.train()
# === 5. Save LoRA Adapter ===
model.save_pretrained("./output/llama3-custom/final")
tokenizer.save_pretrained("./output/llama3-custom/final")
# === 6. Test the Model ===
FastLanguageModel.for_inference(model) # Enable fast inference mode
messages = [
{"role": "user", "content": "Explain what fine-tuning is."}
]
inputs = tokenizer.apply_chat_template(
messages, return_tensors="pt", add_generation_prompt=True
).to("cuda")
outputs = model.generate(
input_ids=inputs,
max_new_tokens=256,
temperature=0.7,
do_sample=True,
)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))
Comparison: Unsloth vs Vanilla Hugging Face
Benchmarks on an RTX 3090 (24GB), Llama 3.1 8B, QLoRA r=16, batch size 2:
| Metric | Hugging Face | Unsloth | Improvement | |--------|-------------|---------|------------| | Training speed | 1.2 it/s | 2.5 it/s | 2.1x faster | | Peak VRAM | 14.2 GB | 8.1 GB | 43% less | | Total training time (1000 steps) | 14 min | 6.7 min | 2.1x faster | | Final loss | 0.847 | 0.845 | Equivalent |
The quality (measured by final loss and human evaluation) is identical — Unsloth does not cut corners on the math, it just computes the same math more efficiently.
When to Use Unsloth
Use Unsloth when:
- You are training on consumer GPUs (16-24GB VRAM)
- You want to minimize cloud GPU costs
- You are doing rapid iteration and need fast turnaround
- You are training on Colab/Kaggle free tier
Use vanilla Hugging Face when:
- You need multi-GPU training (Unsloth's multi-GPU support is more limited)
- You need custom training loops that modify the training step
- You are using a model architecture Unsloth does not support yet
- You need DeepSpeed or FSDP integration
Saving and Exporting
Unsloth makes exporting easy:
# Save as LoRA adapter (smallest, for later merging)
model.save_pretrained("output/lora-adapter")
# Save merged model in float16
model.save_pretrained_merged("output/merged-f16", tokenizer, save_method="merged_16bit")
# Save as GGUF for Ollama/llama.cpp
model.save_pretrained_gguf("output/gguf", tokenizer, quantization_method="q4_k_m")
# Push to Hugging Face Hub
model.push_to_hub_merged("your-username/your-model", tokenizer)
model.push_to_hub_gguf("your-username/your-model-gguf", tokenizer, quantization_method="q4_k_m")
This is one of Unsloth's biggest conveniences — converting to GGUF directly from the training script, without a separate conversion step.
In the next lesson, we cover how to monitor your training runs and debug the most common issues that arise during fine-tuning.