Monitoring and Debugging Training — Fine-Tuning LLMs: From Data to Deployment

Why Monitoring Matters

A fine-tuning run can take anywhere from 30 minutes to several hours. Without proper monitoring, you may only discover problems after the run completes — wasting compute and time. Good monitoring lets you catch overfitting, detect data issues, and kill bad runs early. This lesson covers the metrics you should track, the patterns you should recognize, and the tools that make it all visible.

Essential Training Metrics

Training Loss

The primary metric. It measures how well the model predicts the next token on the training data. A healthy training loss curve shows a steep initial drop followed by a gradual decrease.

What to watch for:

Steady decrease: Good. The model is learning.
Plateau early: The learning rate may be too low, or the task is too simple for the LoRA rank.
Oscillation: The learning rate is too high or the batch size is too small.
Sudden spike: Possible data corruption at that batch, or a numerical instability.

Validation Loss

Measures performance on data the model has not seen during training. The gap between training loss and validation loss is your overfitting indicator.

Healthy pattern: Training loss and validation loss decrease together, with validation loss slightly higher.

Overfitting pattern: Training loss keeps decreasing while validation loss starts increasing. This means the model is memorizing training examples rather than learning generalizable patterns.

# In TrainingArguments, enable evaluation
training_args = TrainingArguments(
    eval_strategy="steps",
    eval_steps=50,              # Evaluate every 50 steps
    load_best_model_at_end=True,  # Reload the best checkpoint
    metric_for_best_model="eval_loss",
    greater_is_better=False,
)

Learning Rate Schedule

Track the learning rate to verify your scheduler is working correctly. A cosine schedule should show a smooth curve from peak to near-zero. A linear schedule should show a straight decline.

Gradient Norm

The magnitude of gradients tells you about training stability. Gradient norms that spike to very high values indicate instability. Gradient norms that collapse to near zero indicate vanishing gradients.

# Enable gradient norm logging
training_args = TrainingArguments(
    logging_steps=10,
    logging_first_step=True,
    max_grad_norm=1.0,  # Clip gradients above this norm
)

Detecting and Fixing Overfitting

Overfitting is the most common problem in fine-tuning, especially with small datasets.

Signs of Overfitting

Validation loss increases while training loss decreases
Model outputs become near-exact copies of training examples
Model quality degrades on out-of-distribution inputs
Training loss reaches very low values (below 0.1) unusually quickly

Remedies

Reduce epochs. If overfitting starts at epoch 2, train for 1.5 epochs.
Increase dropout. Raise lora_dropout from 0.05 to 0.1.
Reduce rank. A lower LoRA rank (r=8 instead of r=16) reduces model capacity.
Add more data. The best solution if possible.
Use early stopping. Stop training when validation loss has not improved for N evaluation steps.

from transformers import EarlyStoppingCallback

trainer = SFTTrainer(
    # ... other arguments
    callbacks=[EarlyStoppingCallback(early_stopping_patience=5)],
)

Debugging Common Issues

NaN Loss

NaN (Not a Number) loss means numerical overflow has occurred. This typically kills the training run.

Common causes and fixes:

Learning rate too high. Reduce from 2e-4 to 5e-5.
Mixed precision issues. Switch from fp16 to bf16 (if your GPU supports it). bf16 has a larger dynamic range and is less prone to overflow.
Data issues. Check for examples with extremely long text or unusual characters that produce very high loss values.
Gradient explosion. Reduce max_grad_norm from 1.0 to 0.3.

# Debug NaN loss step by step
training_args = TrainingArguments(
    learning_rate=5e-5,        # Lower learning rate
    bf16=True,                 # Use bfloat16 instead of fp16
    max_grad_norm=0.3,         # Aggressive gradient clipping
    logging_steps=1,           # Log every step to find where NaN occurs
)

Gradient Explosion

Symptoms: Loss spikes suddenly, gradient norms become very large (>100), training may recover or collapse.

Fixes:

Lower max_grad_norm (try 0.3 or even 0.1)
Reduce learning rate
Increase warmup steps to let the model adjust gradually

Data Formatting Errors

The most insidious bugs. The model trains without errors, but produces garbage because the conversation template was wrong.

How to verify:

# Always inspect formatted examples before training
sample = dataset["train"][0]
formatted = tokenizer.apply_chat_template(
    sample["messages"],
    tokenize=False,
)
print("=== Formatted example ===")
print(formatted)
print("=== Token IDs ===")
tokens = tokenizer.encode(formatted)
print(tokens[:50])  # First 50 tokens
print(f"Total tokens: {len(tokens)}")

Check that:

Special tokens (BOS, EOS, role markers) are present and correct
The model's response is not being masked during training (the model should learn from assistant responses)
Padding tokens are not mixed into the content

Model Produces Repetitive Output

The model generates the same phrase over and over, or enters a loop.

Causes:

Overfitting on repetitive training data
Training loss went too low (over-optimization)
Incorrect generation parameters (temperature=0 with no penalty)

Fixes:

Check training data for duplicates
Reduce training epochs
Use repetition_penalty=1.1 during generation

Setting Up TensorBoard

TensorBoard is the simplest monitoring solution — it comes built-in with Hugging Face:

training_args = TrainingArguments(
    report_to="tensorboard",
    logging_dir="./logs",
    logging_steps=10,
)

Launch TensorBoard:

tensorboard --logdir ./logs

This opens a dashboard at http://localhost:6006 where you can see loss curves, learning rate, gradient norms, and more in real time.

Setting Up Weights and Biases

W&B provides more features: experiment comparison, hyperparameter sweeps, team collaboration, and model artifact tracking.

pip install wandb
wandb login  # Enter your API key

import wandb

wandb.init(
    project="llm-fine-tuning",
    name="llama3-8b-legal-v1",
    config={
        "model": "Llama-3.1-8B",
        "rank": 16,
        "alpha": 32,
        "learning_rate": 2e-4,
        "dataset_size": len(dataset["train"]),
    }
)

training_args = TrainingArguments(
    report_to="wandb",
    logging_steps=10,
)

W&B automatically logs all training metrics, system metrics (GPU memory, utilization), and allows you to add custom logging:

# Log sample predictions during training
class PredictionCallback(TrainerCallback):
    def on_evaluate(self, args, state, control, **kwargs):
        # Generate predictions on a few examples
        model.set_training_mode(False)
        test_prompts = ["Explain LoRA", "What is QLoRA"]
        for prompt in test_prompts:
            output = generate(model, tokenizer, prompt)
            wandb.log({f"prediction/{prompt}": output, "step": state.global_step})

Checkpointing Strategy

Save checkpoints wisely — they are your insurance policy:

training_args = TrainingArguments(
    save_strategy="steps",
    save_steps=100,           # Save every 100 steps
    save_total_limit=3,       # Keep only the 3 most recent
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
)

Tips:

For short runs (under 500 steps), save every 50 steps
For long runs (over 2000 steps), save every 200 steps
Always keep at least the best checkpoint based on eval loss
save_total_limit prevents filling up disk space with large checkpoints

Practical Debugging Checklist

When a training run produces poor results, work through this checklist:

Inspect raw data. Read 10 random training examples. Are they correct?
Check formatting. Print a formatted example with all special tokens visible. Do they match the model's expected template?
Verify tokenization. Are examples being truncated? What is the token length distribution?
Review loss curves. Is training loss decreasing? Is eval loss diverging?
Test manually. Generate outputs from the model at different checkpoints. Does quality improve?
Compare to base model. Is the fine-tuned model actually better than the base model on your task?

In the next lesson, we tackle the often-overlooked topic of evaluation — how to systematically measure whether your fine-tuned model actually improved.