Evaluation and Benchmarks
The Evaluation Gap
You trained a model, the loss went down, and it generates text that looks reasonable. But is it actually better than the base model for your use case? Without rigorous evaluation, you are guessing. Most fine-tuning projects fail not because the training was bad, but because the evaluation was insufficient — or nonexistent. This lesson gives you a complete evaluation toolkit.
Task-Specific Automated Metrics
Different tasks demand different metrics. Choose the ones that match your use case.
Exact Match
For tasks with a single correct answer (classification, entity extraction, closed-ended QA):
def exact_match(predictions, references):
correct = sum(1 for p, r in zip(predictions, references) if p.strip() == r.strip())
return correct / len(predictions)
# Example
predictions = ["positive", "negative", "neutral"]
references = ["positive", "negative", "positive"]
print(f"Exact Match: {exact_match(predictions, references):.2%}") # 66.67%
F1 Score
For classification tasks, especially with imbalanced classes:
from sklearn.metrics import classification_report
predictions = ["positive", "negative", "neutral", "positive", "negative"]
references = ["positive", "positive", "neutral", "positive", "negative"]
print(classification_report(references, predictions))
BLEU Score
For translation and text generation tasks where you have reference outputs:
from nltk.translate.bleu_score import sentence_bleu, corpus_bleu
reference = [["the", "cat", "sat", "on", "the", "mat"]]
candidate = ["the", "cat", "is", "on", "the", "mat"]
score = sentence_bleu(reference, candidate)
print(f"BLEU: {score:.4f}")
Caution: BLEU measures n-gram overlap with a reference. It is useful for translation but misleading for open-ended generation where multiple valid outputs exist.
ROUGE Score
For summarization tasks, measuring overlap between generated and reference summaries:
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
reference = "The model was fine-tuned on legal documents for contract analysis."
generated = "Fine-tuning was performed on legal documents to analyze contracts."
scores = scorer.score(reference, generated)
for metric, score in scores.items():
print(f"{metric}: Precision={score.precision:.3f}, Recall={score.recall:.3f}, F1={score.fmeasure:.3f}")
Human Evaluation
Automated metrics cannot capture everything. For subjective quality (helpfulness, coherence, safety), human evaluation is essential.
Blind Comparison Protocol
The most reliable human evaluation method: show evaluators outputs from the base model and fine-tuned model side-by-side, without revealing which is which.
import random
import json
def create_blind_evaluation_set(test_prompts, base_outputs, finetuned_outputs):
"""Create a randomized blind evaluation set."""
evaluation_pairs = []
for i, prompt in enumerate(test_prompts):
# Randomly assign A/B positions
if random.random() > 0.5:
pair = {
"id": i,
"prompt": prompt,
"response_a": base_outputs[i],
"response_b": finetuned_outputs[i],
"mapping": {"a": "base", "b": "finetuned"}
}
else:
pair = {
"id": i,
"prompt": prompt,
"response_a": finetuned_outputs[i],
"response_b": base_outputs[i],
"mapping": {"a": "finetuned", "b": "base"}
}
evaluation_pairs.append(pair)
return evaluation_pairs
Evaluation Rubric
Define specific criteria for evaluators:
- Accuracy (1-5): Is the information correct?
- Relevance (1-5): Does the response address the prompt?
- Format compliance (1-5): Does the output follow the required format?
- Fluency (1-5): Is the language natural and well-written?
- Overall preference: Which response is better? (A / B / Tie)
Tip: Use at least 3 evaluators per example and measure inter-annotator agreement. If evaluators disagree significantly, your rubric needs clarification.
LLM-as-Judge Evaluation
Using a powerful LLM (GPT-4o, Claude) to evaluate outputs is faster and cheaper than human evaluation, while correlating well with human preferences.
Single-Output Scoring
import openai
client = openai.OpenAI()
def llm_judge_score(prompt, response, criteria):
"""Score a model response on a 1-10 scale using an LLM judge."""
judge_prompt = f"""You are an expert evaluator. Score the following AI response
on a scale of 1-10 based on these criteria:
{criteria}
User Prompt: {prompt}
AI Response: {response}
Provide your score as a JSON object with keys "score" (integer 1-10)
and "reasoning" (brief explanation).
"""
result = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": judge_prompt}],
response_format={"type": "json_object"},
temperature=0,
)
return json.loads(result.choices[0].message.content)
# Example usage
score = llm_judge_score(
prompt="Explain the concept of LoRA in simple terms",
response="LoRA is a way to teach an AI new tricks without retraining everything...",
criteria="Accuracy, clarity, completeness, and appropriate simplification for a general audience."
)
print(f"Score: {score['score']}/10 - {score['reasoning']}")
Pairwise Comparison
More reliable than absolute scoring — ask the judge to compare two outputs:
def llm_judge_compare(prompt, response_a, response_b, criteria):
"""Compare two responses and select the better one."""
judge_prompt = f"""You are an expert evaluator. Compare these two AI responses
and determine which is better based on: {criteria}
User Prompt: {prompt}
Response A: {response_a}
Response B: {response_b}
Respond with JSON: {{"winner": "A" or "B" or "tie", "reasoning": "brief explanation"}}
"""
result = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": judge_prompt}],
response_format={"type": "json_object"},
temperature=0,
)
return json.loads(result.choices[0].message.content)
Important: LLM judges have known biases — they tend to prefer longer responses, responses with bullet points, and the first response in a pair (position bias). Mitigate position bias by running each comparison twice with swapped positions.
Benchmark Contamination
A critical risk: if your training data contains examples from common benchmarks, your evaluation scores will be inflated.
How it happens:
- Synthetic data generated by GPT-4 may contain paraphrased benchmark questions
- Web-scraped data may include benchmark datasets
- Even indirect contamination (training on blog posts that discuss benchmark questions) can inflate scores
How to prevent it:
- Use a custom evaluation set that did not exist before your project
- Run n-gram overlap checks between your training data and evaluation set
- Report both standard benchmark scores and custom evaluation scores
def check_contamination(train_data, eval_data, n=8):
"""Check for n-gram overlap between training and eval sets."""
from collections import Counter
def get_ngrams(text, n):
words = text.lower().split()
return set(tuple(words[i:i+n]) for i in range(len(words) - n + 1))
train_ngrams = set()
for example in train_data:
text = example.get("instruction", "") + " " + example.get("output", "")
train_ngrams.update(get_ngrams(text, n))
contaminated = []
for i, example in enumerate(eval_data):
text = example.get("instruction", "") + " " + example.get("output", "")
eval_ngrams = get_ngrams(text, n)
overlap = eval_ngrams & train_ngrams
if overlap:
contaminated.append(i)
print(f"Potentially contaminated: {len(contaminated)}/{len(eval_data)} eval examples")
return contaminated
Building Custom Evaluation Datasets
The best evaluation set is one you build specifically for your use case:
- Define 5-10 capability categories your model should excel at
- Write 10-20 test prompts per category covering easy, medium, and hard difficulty
- Create reference outputs (ideal responses) for automated comparison
- Include adversarial examples that test edge cases and failure modes
- Keep the eval set completely separate from training data — never use it for any purpose other than evaluation
Standard Benchmarks
For broader comparison with other models:
- MT-Bench: 80 multi-turn questions across 8 categories. Uses GPT-4 as judge. Good for conversational models.
- AlpacaEval: 805 instructions evaluated by GPT-4. Measures general instruction-following ability.
- MMLU: Multiple-choice questions across 57 subjects. Tests factual knowledge.
- HumanEval / MBPP: Code generation benchmarks. Essential if your model generates code.
Practical Evaluation Workflow
def full_evaluation(model, tokenizer, test_set, base_model=None):
"""Complete evaluation pipeline."""
results = {"automated": {}, "llm_judge": {}, "comparison": {}}
# 1. Generate outputs
predictions = []
for example in test_set:
output = generate(model, tokenizer, example["prompt"])
predictions.append(output)
# 2. Automated metrics (if reference outputs exist)
if "reference" in test_set[0]:
references = [ex["reference"] for ex in test_set]
results["automated"]["exact_match"] = exact_match(predictions, references)
# 3. LLM-as-judge scoring
scores = []
for example, pred in zip(test_set, predictions):
score = llm_judge_score(example["prompt"], pred, "accuracy, helpfulness, format")
scores.append(score["score"])
results["llm_judge"]["mean_score"] = sum(scores) / len(scores)
# 4. Comparison with base model (if available)
if base_model:
base_predictions = [generate(base_model, tokenizer, ex["prompt"]) for ex in test_set]
wins = 0
for example, ft_pred, base_pred in zip(test_set, predictions, base_predictions):
result = llm_judge_compare(example["prompt"], ft_pred, base_pred, "overall quality")
if result["winner"] == "A":
wins += 1
results["comparison"]["win_rate"] = wins / len(test_set)
return results
In the next lesson, we will cover how to merge your LoRA adapter, export it to production formats, and prepare it for deployment.