Training with Hugging Face — Fine-Tuning LLMs: From Data to Deployment

The Hugging Face Training Stack

Hugging Face provides the most mature and widely-used ecosystem for fine-tuning LLMs. The stack consists of four main libraries: transformers for model loading, peft for LoRA/QLoRA, trl for training (SFTTrainer, DPOTrainer), and bitsandbytes for quantization. This lesson walks through a complete training pipeline with every parameter explained.

Setup and Dependencies

pip install torch transformers peft trl datasets bitsandbytes accelerate

For Flash Attention 2 support (recommended for speed):

pip install flash-attn --no-build-isolation

Loading the Base Model with QLoRA

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Model identifier
model_name = "meta-llama/Llama-3.1-8B-Instruct"

# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Load model with quantization
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation="flash_attention_2",  # Use Flash Attention
    torch_dtype=torch.bfloat16,
)

# Prepare model for QLoRA training
model = prepare_model_for_kbit_training(model)
model.config.use_cache = False  # Disable KV cache during training

Key details:

pad_token = eos_token prevents errors when batching sequences of different lengths.
padding_side = "right" is required for causal language models.
use_cache = False disables the KV cache which conflicts with gradient checkpointing.

Configuring LoRA

from peft import LoraConfig, TaskType

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

# Apply LoRA to the model
model = get_peft_model(model, lora_config)

# Print trainable parameters
model.print_trainable_parameters()
# Output: trainable params: 20,971,520 || all params: 8,051,232,768 || trainable%: 0.26%

Loading the Dataset

from datasets import load_dataset

# Load from local JSONL files
dataset = load_dataset("json", data_files={
    "train": "data/train.jsonl",
    "validation": "data/val.jsonl",
})

# Or load from Hugging Face Hub
# dataset = load_dataset("your-username/your-dataset")

print(f"Train: {len(dataset['train'])} examples")
print(f"Validation: {len(dataset['validation'])} examples")

SFTTrainer: Supervised Fine-Tuning

The SFTTrainer from trl handles the training loop, data formatting, and evaluation:

from trl import SFTTrainer
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./output/llama3-finetuned",

    # Training duration
    num_train_epochs=3,                  # 2-5 epochs typical
    max_steps=-1,                        # -1 means use num_train_epochs

    # Batch size
    per_device_train_batch_size=2,       # Adjust based on VRAM
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=8,       # Effective batch size: 2 * 8 = 16

    # Learning rate
    learning_rate=2e-4,                  # 1e-4 to 2e-4 for QLoRA
    lr_scheduler_type="cosine",          # Cosine decay
    warmup_ratio=0.05,                   # 5% warmup steps

    # Optimization
    optim="paged_adamw_8bit",            # Memory-efficient optimizer
    weight_decay=0.01,
    max_grad_norm=1.0,                   # Gradient clipping

    # Precision
    bf16=True,                           # Use bfloat16

    # Logging
    logging_steps=10,
    logging_first_step=True,

    # Evaluation
    eval_strategy="steps",
    eval_steps=50,

    # Saving
    save_strategy="steps",
    save_steps=100,
    save_total_limit=3,                  # Keep only 3 best checkpoints

    # Other
    gradient_checkpointing=True,         # Trade compute for memory
    gradient_checkpointing_kwargs={"use_reentrant": False},
    report_to="tensorboard",             # Or "wandb"
    seed=42,
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    processing_class=tokenizer,
    max_seq_length=2048,
    packing=False,                       # Set True for short examples
    dataset_text_field=None,             # Use 'messages' format
)

Understanding Key Training Arguments

Learning rate: The most impactful hyperparameter. For QLoRA with rank 16:

Start with 2e-4 for small datasets (under 1,000 examples)
Use 1e-4 for larger datasets (over 5,000 examples)
If training loss oscillates wildly, reduce it

Batch size and gradient accumulation: The effective batch size is per_device_batch_size * gradient_accumulation_steps * num_gpus. Larger effective batch sizes produce more stable training but require more memory. Start with an effective batch size of 16.

Epochs: For small datasets (200-500 examples), use 3-5 epochs. For larger datasets (5,000+), 1-2 epochs is often sufficient. Watch validation loss to detect overfitting.

Warmup: Gradually increases the learning rate at the start of training. 5% of total steps is a safe default. Helps prevent early instability.

Running Training

# Start training
trainer.train()

# Save the final LoRA adapter
trainer.save_model("./output/llama3-finetuned/final")
tokenizer.save_pretrained("./output/llama3-finetuned/final")

DPO Training: Preference Alignment

Direct Preference Optimization (DPO) trains the model on pairs of preferred and rejected responses. This is used after SFT to further align the model's behavior.

from trl import DPOTrainer, DPOConfig

# Load your preference dataset
# Each example needs: prompt, chosen, rejected
dpo_dataset = load_dataset("json", data_files="data/preferences.jsonl")

dpo_config = DPOConfig(
    output_dir="./output/llama3-dpo",
    num_train_epochs=1,                  # DPO typically needs fewer epochs
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=5e-5,                  # Lower LR than SFT
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    beta=0.1,                            # DPO temperature parameter
    bf16=True,
    optim="paged_adamw_8bit",
    gradient_checkpointing=True,
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=50,
    report_to="tensorboard",
)

dpo_trainer = DPOTrainer(
    model=model,
    args=dpo_config,
    train_dataset=dpo_dataset["train"],
    eval_dataset=dpo_dataset["validation"],
    processing_class=tokenizer,
)

dpo_trainer.train()

DPO beta parameter: Controls how much the model deviates from the reference policy. Lower beta (0.05) means more deviation, higher beta (0.5) keeps the model closer to its SFT behavior. Start with 0.1.

Testing Your Fine-Tuned Model

from peft import PeftModel

# Load the fine-tuned model for inference
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
model = PeftModel.from_pretrained(base_model, "./output/llama3-finetuned/final")

# Generate a response
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain what LoRA is in simple terms."}
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=256,
    temperature=0.7,
    do_sample=True,
)

response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
print(response)

Practical Tips

Always monitor GPU memory during the first few training steps. Use nvidia-smi or torch.cuda.memory_summary().
Start with a small subset of your data (50-100 examples) to verify the pipeline works before training on the full dataset.
Save checkpoints frequently early in training. If something goes wrong at step 500, you do not want to restart from scratch.
Check a few model outputs at each checkpoint. Loss numbers alone do not tell you if the model is learning the right behavior.

In the next lesson, we will cover Unsloth — an alternative training framework that offers 2x speed improvements and 60% memory reduction over vanilla Hugging Face training.