Training with Hugging Face
The Hugging Face Training Stack
Hugging Face provides the most mature and widely-used ecosystem for fine-tuning LLMs. The stack consists of four main libraries: transformers for model loading, peft for LoRA/QLoRA, trl for training (SFTTrainer, DPOTrainer), and bitsandbytes for quantization. This lesson walks through a complete training pipeline with every parameter explained.
Setup and Dependencies
pip install torch transformers peft trl datasets bitsandbytes accelerate
For Flash Attention 2 support (recommended for speed):
pip install flash-attn --no-build-isolation
Loading the Base Model with QLoRA
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
# Model identifier
model_name = "meta-llama/Llama-3.1-8B-Instruct"
# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
# Load model with quantization
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
attn_implementation="flash_attention_2", # Use Flash Attention
torch_dtype=torch.bfloat16,
)
# Prepare model for QLoRA training
model = prepare_model_for_kbit_training(model)
model.config.use_cache = False # Disable KV cache during training
Key details:
pad_token = eos_tokenprevents errors when batching sequences of different lengths.padding_side = "right"is required for causal language models.use_cache = Falsedisables the KV cache which conflicts with gradient checkpointing.
Configuring LoRA
from peft import LoraConfig, TaskType
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM,
)
# Apply LoRA to the model
model = get_peft_model(model, lora_config)
# Print trainable parameters
model.print_trainable_parameters()
# Output: trainable params: 20,971,520 || all params: 8,051,232,768 || trainable%: 0.26%
Loading the Dataset
from datasets import load_dataset
# Load from local JSONL files
dataset = load_dataset("json", data_files={
"train": "data/train.jsonl",
"validation": "data/val.jsonl",
})
# Or load from Hugging Face Hub
# dataset = load_dataset("your-username/your-dataset")
print(f"Train: {len(dataset['train'])} examples")
print(f"Validation: {len(dataset['validation'])} examples")
SFTTrainer: Supervised Fine-Tuning
The SFTTrainer from trl handles the training loop, data formatting, and evaluation:
from trl import SFTTrainer
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./output/llama3-finetuned",
# Training duration
num_train_epochs=3, # 2-5 epochs typical
max_steps=-1, # -1 means use num_train_epochs
# Batch size
per_device_train_batch_size=2, # Adjust based on VRAM
per_device_eval_batch_size=2,
gradient_accumulation_steps=8, # Effective batch size: 2 * 8 = 16
# Learning rate
learning_rate=2e-4, # 1e-4 to 2e-4 for QLoRA
lr_scheduler_type="cosine", # Cosine decay
warmup_ratio=0.05, # 5% warmup steps
# Optimization
optim="paged_adamw_8bit", # Memory-efficient optimizer
weight_decay=0.01,
max_grad_norm=1.0, # Gradient clipping
# Precision
bf16=True, # Use bfloat16
# Logging
logging_steps=10,
logging_first_step=True,
# Evaluation
eval_strategy="steps",
eval_steps=50,
# Saving
save_strategy="steps",
save_steps=100,
save_total_limit=3, # Keep only 3 best checkpoints
# Other
gradient_checkpointing=True, # Trade compute for memory
gradient_checkpointing_kwargs={"use_reentrant": False},
report_to="tensorboard", # Or "wandb"
seed=42,
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
processing_class=tokenizer,
max_seq_length=2048,
packing=False, # Set True for short examples
dataset_text_field=None, # Use 'messages' format
)
Understanding Key Training Arguments
Learning rate: The most impactful hyperparameter. For QLoRA with rank 16:
- Start with
2e-4for small datasets (under 1,000 examples) - Use
1e-4for larger datasets (over 5,000 examples) - If training loss oscillates wildly, reduce it
Batch size and gradient accumulation: The effective batch size is per_device_batch_size * gradient_accumulation_steps * num_gpus. Larger effective batch sizes produce more stable training but require more memory. Start with an effective batch size of 16.
Epochs: For small datasets (200-500 examples), use 3-5 epochs. For larger datasets (5,000+), 1-2 epochs is often sufficient. Watch validation loss to detect overfitting.
Warmup: Gradually increases the learning rate at the start of training. 5% of total steps is a safe default. Helps prevent early instability.
Running Training
# Start training
trainer.train()
# Save the final LoRA adapter
trainer.save_model("./output/llama3-finetuned/final")
tokenizer.save_pretrained("./output/llama3-finetuned/final")
DPO Training: Preference Alignment
Direct Preference Optimization (DPO) trains the model on pairs of preferred and rejected responses. This is used after SFT to further align the model's behavior.
from trl import DPOTrainer, DPOConfig
# Load your preference dataset
# Each example needs: prompt, chosen, rejected
dpo_dataset = load_dataset("json", data_files="data/preferences.jsonl")
dpo_config = DPOConfig(
output_dir="./output/llama3-dpo",
num_train_epochs=1, # DPO typically needs fewer epochs
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=5e-5, # Lower LR than SFT
lr_scheduler_type="cosine",
warmup_ratio=0.1,
beta=0.1, # DPO temperature parameter
bf16=True,
optim="paged_adamw_8bit",
gradient_checkpointing=True,
logging_steps=10,
eval_strategy="steps",
eval_steps=50,
report_to="tensorboard",
)
dpo_trainer = DPOTrainer(
model=model,
args=dpo_config,
train_dataset=dpo_dataset["train"],
eval_dataset=dpo_dataset["validation"],
processing_class=tokenizer,
)
dpo_trainer.train()
DPO beta parameter: Controls how much the model deviates from the reference policy. Lower beta (0.05) means more deviation, higher beta (0.5) keeps the model closer to its SFT behavior. Start with 0.1.
Testing Your Fine-Tuned Model
from peft import PeftModel
# Load the fine-tuned model for inference
base_model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
torch_dtype=torch.bfloat16,
)
model = PeftModel.from_pretrained(base_model, "./output/llama3-finetuned/final")
# Generate a response
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain what LoRA is in simple terms."}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(
inputs,
max_new_tokens=256,
temperature=0.7,
do_sample=True,
)
response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
print(response)
Practical Tips
- Always monitor GPU memory during the first few training steps. Use
nvidia-smiortorch.cuda.memory_summary(). - Start with a small subset of your data (50-100 examples) to verify the pipeline works before training on the full dataset.
- Save checkpoints frequently early in training. If something goes wrong at step 500, you do not want to restart from scratch.
- Check a few model outputs at each checkpoint. Loss numbers alone do not tell you if the model is learning the right behavior.
In the next lesson, we will cover Unsloth — an alternative training framework that offers 2x speed improvements and 60% memory reduction over vanilla Hugging Face training.