Monitoring and Debugging Training
Why Monitoring Matters
A fine-tuning run can take anywhere from 30 minutes to several hours. Without proper monitoring, you may only discover problems after the run completes — wasting compute and time. Good monitoring lets you catch overfitting, detect data issues, and kill bad runs early. This lesson covers the metrics you should track, the patterns you should recognize, and the tools that make it all visible.
Essential Training Metrics
Training Loss
The primary metric. It measures how well the model predicts the next token on the training data. A healthy training loss curve shows a steep initial drop followed by a gradual decrease.
What to watch for:
- Steady decrease: Good. The model is learning.
- Plateau early: The learning rate may be too low, or the task is too simple for the LoRA rank.
- Oscillation: The learning rate is too high or the batch size is too small.
- Sudden spike: Possible data corruption at that batch, or a numerical instability.
Validation Loss
Measures performance on data the model has not seen during training. The gap between training loss and validation loss is your overfitting indicator.
Healthy pattern: Training loss and validation loss decrease together, with validation loss slightly higher.
Overfitting pattern: Training loss keeps decreasing while validation loss starts increasing. This means the model is memorizing training examples rather than learning generalizable patterns.
# In TrainingArguments, enable evaluation
training_args = TrainingArguments(
eval_strategy="steps",
eval_steps=50, # Evaluate every 50 steps
load_best_model_at_end=True, # Reload the best checkpoint
metric_for_best_model="eval_loss",
greater_is_better=False,
)
Learning Rate Schedule
Track the learning rate to verify your scheduler is working correctly. A cosine schedule should show a smooth curve from peak to near-zero. A linear schedule should show a straight decline.
Gradient Norm
The magnitude of gradients tells you about training stability. Gradient norms that spike to very high values indicate instability. Gradient norms that collapse to near zero indicate vanishing gradients.
# Enable gradient norm logging
training_args = TrainingArguments(
logging_steps=10,
logging_first_step=True,
max_grad_norm=1.0, # Clip gradients above this norm
)
Detecting and Fixing Overfitting
Overfitting is the most common problem in fine-tuning, especially with small datasets.
Signs of Overfitting
- Validation loss increases while training loss decreases
- Model outputs become near-exact copies of training examples
- Model quality degrades on out-of-distribution inputs
- Training loss reaches very low values (below 0.1) unusually quickly
Remedies
- Reduce epochs. If overfitting starts at epoch 2, train for 1.5 epochs.
- Increase dropout. Raise
lora_dropoutfrom 0.05 to 0.1. - Reduce rank. A lower LoRA rank (r=8 instead of r=16) reduces model capacity.
- Add more data. The best solution if possible.
- Use early stopping. Stop training when validation loss has not improved for N evaluation steps.
from transformers import EarlyStoppingCallback
trainer = SFTTrainer(
# ... other arguments
callbacks=[EarlyStoppingCallback(early_stopping_patience=5)],
)
Debugging Common Issues
NaN Loss
NaN (Not a Number) loss means numerical overflow has occurred. This typically kills the training run.
Common causes and fixes:
- Learning rate too high. Reduce from 2e-4 to 5e-5.
- Mixed precision issues. Switch from fp16 to bf16 (if your GPU supports it). bf16 has a larger dynamic range and is less prone to overflow.
- Data issues. Check for examples with extremely long text or unusual characters that produce very high loss values.
- Gradient explosion. Reduce
max_grad_normfrom 1.0 to 0.3.
# Debug NaN loss step by step
training_args = TrainingArguments(
learning_rate=5e-5, # Lower learning rate
bf16=True, # Use bfloat16 instead of fp16
max_grad_norm=0.3, # Aggressive gradient clipping
logging_steps=1, # Log every step to find where NaN occurs
)
Gradient Explosion
Symptoms: Loss spikes suddenly, gradient norms become very large (>100), training may recover or collapse.
Fixes:
- Lower
max_grad_norm(try 0.3 or even 0.1) - Reduce learning rate
- Increase warmup steps to let the model adjust gradually
Data Formatting Errors
The most insidious bugs. The model trains without errors, but produces garbage because the conversation template was wrong.
How to verify:
# Always inspect formatted examples before training
sample = dataset["train"][0]
formatted = tokenizer.apply_chat_template(
sample["messages"],
tokenize=False,
)
print("=== Formatted example ===")
print(formatted)
print("=== Token IDs ===")
tokens = tokenizer.encode(formatted)
print(tokens[:50]) # First 50 tokens
print(f"Total tokens: {len(tokens)}")
Check that:
- Special tokens (BOS, EOS, role markers) are present and correct
- The model's response is not being masked during training (the model should learn from assistant responses)
- Padding tokens are not mixed into the content
Model Produces Repetitive Output
The model generates the same phrase over and over, or enters a loop.
Causes:
- Overfitting on repetitive training data
- Training loss went too low (over-optimization)
- Incorrect generation parameters (temperature=0 with no penalty)
Fixes:
- Check training data for duplicates
- Reduce training epochs
- Use
repetition_penalty=1.1during generation
Setting Up TensorBoard
TensorBoard is the simplest monitoring solution — it comes built-in with Hugging Face:
training_args = TrainingArguments(
report_to="tensorboard",
logging_dir="./logs",
logging_steps=10,
)
Launch TensorBoard:
tensorboard --logdir ./logs
This opens a dashboard at http://localhost:6006 where you can see loss curves, learning rate, gradient norms, and more in real time.
Setting Up Weights and Biases
W&B provides more features: experiment comparison, hyperparameter sweeps, team collaboration, and model artifact tracking.
pip install wandb
wandb login # Enter your API key
import wandb
wandb.init(
project="llm-fine-tuning",
name="llama3-8b-legal-v1",
config={
"model": "Llama-3.1-8B",
"rank": 16,
"alpha": 32,
"learning_rate": 2e-4,
"dataset_size": len(dataset["train"]),
}
)
training_args = TrainingArguments(
report_to="wandb",
logging_steps=10,
)
W&B automatically logs all training metrics, system metrics (GPU memory, utilization), and allows you to add custom logging:
# Log sample predictions during training
class PredictionCallback(TrainerCallback):
def on_evaluate(self, args, state, control, **kwargs):
# Generate predictions on a few examples
model.set_training_mode(False)
test_prompts = ["Explain LoRA", "What is QLoRA"]
for prompt in test_prompts:
output = generate(model, tokenizer, prompt)
wandb.log({f"prediction/{prompt}": output, "step": state.global_step})
Checkpointing Strategy
Save checkpoints wisely — they are your insurance policy:
training_args = TrainingArguments(
save_strategy="steps",
save_steps=100, # Save every 100 steps
save_total_limit=3, # Keep only the 3 most recent
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
)
Tips:
- For short runs (under 500 steps), save every 50 steps
- For long runs (over 2000 steps), save every 200 steps
- Always keep at least the best checkpoint based on eval loss
save_total_limitprevents filling up disk space with large checkpoints
Practical Debugging Checklist
When a training run produces poor results, work through this checklist:
- Inspect raw data. Read 10 random training examples. Are they correct?
- Check formatting. Print a formatted example with all special tokens visible. Do they match the model's expected template?
- Verify tokenization. Are examples being truncated? What is the token length distribution?
- Review loss curves. Is training loss decreasing? Is eval loss diverging?
- Test manually. Generate outputs from the model at different checkpoints. Does quality improve?
- Compare to base model. Is the fine-tuned model actually better than the base model on your task?
In the next lesson, we tackle the often-overlooked topic of evaluation — how to systematically measure whether your fine-tuned model actually improved.