Evaluation and Benchmarks — Fine-Tuning LLMs: From Data to Deployment

The Evaluation Gap

You trained a model, the loss went down, and it generates text that looks reasonable. But is it actually better than the base model for your use case? Without rigorous evaluation, you are guessing. Most fine-tuning projects fail not because the training was bad, but because the evaluation was insufficient — or nonexistent. This lesson gives you a complete evaluation toolkit.

Task-Specific Automated Metrics

Different tasks demand different metrics. Choose the ones that match your use case.

Exact Match

For tasks with a single correct answer (classification, entity extraction, closed-ended QA):

def exact_match(predictions, references):
    correct = sum(1 for p, r in zip(predictions, references) if p.strip() == r.strip())
    return correct / len(predictions)

# Example
predictions = ["positive", "negative", "neutral"]
references = ["positive", "negative", "positive"]
print(f"Exact Match: {exact_match(predictions, references):.2%}")  # 66.67%

F1 Score

For classification tasks, especially with imbalanced classes:

from sklearn.metrics import classification_report

predictions = ["positive", "negative", "neutral", "positive", "negative"]
references = ["positive", "positive", "neutral", "positive", "negative"]

print(classification_report(references, predictions))

BLEU Score

For translation and text generation tasks where you have reference outputs:

from nltk.translate.bleu_score import sentence_bleu, corpus_bleu

reference = [["the", "cat", "sat", "on", "the", "mat"]]
candidate = ["the", "cat", "is", "on", "the", "mat"]

score = sentence_bleu(reference, candidate)
print(f"BLEU: {score:.4f}")

Caution: BLEU measures n-gram overlap with a reference. It is useful for translation but misleading for open-ended generation where multiple valid outputs exist.

ROUGE Score

For summarization tasks, measuring overlap between generated and reference summaries:

from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)

reference = "The model was fine-tuned on legal documents for contract analysis."
generated = "Fine-tuning was performed on legal documents to analyze contracts."

scores = scorer.score(reference, generated)
for metric, score in scores.items():
    print(f"{metric}: Precision={score.precision:.3f}, Recall={score.recall:.3f}, F1={score.fmeasure:.3f}")

Human Evaluation

Automated metrics cannot capture everything. For subjective quality (helpfulness, coherence, safety), human evaluation is essential.

Blind Comparison Protocol

The most reliable human evaluation method: show evaluators outputs from the base model and fine-tuned model side-by-side, without revealing which is which.

import random
import json

def create_blind_evaluation_set(test_prompts, base_outputs, finetuned_outputs):
    """Create a randomized blind evaluation set."""
    evaluation_pairs = []

    for i, prompt in enumerate(test_prompts):
        # Randomly assign A/B positions
        if random.random() > 0.5:
            pair = {
                "id": i,
                "prompt": prompt,
                "response_a": base_outputs[i],
                "response_b": finetuned_outputs[i],
                "mapping": {"a": "base", "b": "finetuned"}
            }
        else:
            pair = {
                "id": i,
                "prompt": prompt,
                "response_a": finetuned_outputs[i],
                "response_b": base_outputs[i],
                "mapping": {"a": "finetuned", "b": "base"}
            }
        evaluation_pairs.append(pair)

    return evaluation_pairs

Evaluation Rubric

Define specific criteria for evaluators:

Accuracy (1-5): Is the information correct?
Relevance (1-5): Does the response address the prompt?
Format compliance (1-5): Does the output follow the required format?
Fluency (1-5): Is the language natural and well-written?
Overall preference: Which response is better? (A / B / Tie)

Tip: Use at least 3 evaluators per example and measure inter-annotator agreement. If evaluators disagree significantly, your rubric needs clarification.

LLM-as-Judge Evaluation

Using a powerful LLM (GPT-4o, Claude) to evaluate outputs is faster and cheaper than human evaluation, while correlating well with human preferences.

Single-Output Scoring

import openai

client = openai.OpenAI()

def llm_judge_score(prompt, response, criteria):
    """Score a model response on a 1-10 scale using an LLM judge."""
    judge_prompt = f"""You are an expert evaluator. Score the following AI response
on a scale of 1-10 based on these criteria:

{criteria}

User Prompt: {prompt}

AI Response: {response}

Provide your score as a JSON object with keys "score" (integer 1-10)
and "reasoning" (brief explanation).
"""

    result = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": judge_prompt}],
        response_format={"type": "json_object"},
        temperature=0,
    )

    return json.loads(result.choices[0].message.content)

# Example usage
score = llm_judge_score(
    prompt="Explain the concept of LoRA in simple terms",
    response="LoRA is a way to teach an AI new tricks without retraining everything...",
    criteria="Accuracy, clarity, completeness, and appropriate simplification for a general audience."
)
print(f"Score: {score['score']}/10 - {score['reasoning']}")

Pairwise Comparison

More reliable than absolute scoring — ask the judge to compare two outputs:

def llm_judge_compare(prompt, response_a, response_b, criteria):
    """Compare two responses and select the better one."""
    judge_prompt = f"""You are an expert evaluator. Compare these two AI responses
and determine which is better based on: {criteria}

User Prompt: {prompt}

Response A: {response_a}

Response B: {response_b}

Respond with JSON: {{"winner": "A" or "B" or "tie", "reasoning": "brief explanation"}}
"""

    result = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": judge_prompt}],
        response_format={"type": "json_object"},
        temperature=0,
    )

    return json.loads(result.choices[0].message.content)

Important: LLM judges have known biases — they tend to prefer longer responses, responses with bullet points, and the first response in a pair (position bias). Mitigate position bias by running each comparison twice with swapped positions.

Benchmark Contamination

A critical risk: if your training data contains examples from common benchmarks, your evaluation scores will be inflated.

How it happens:

Synthetic data generated by GPT-4 may contain paraphrased benchmark questions
Web-scraped data may include benchmark datasets
Even indirect contamination (training on blog posts that discuss benchmark questions) can inflate scores

How to prevent it:

Use a custom evaluation set that did not exist before your project
Run n-gram overlap checks between your training data and evaluation set
Report both standard benchmark scores and custom evaluation scores

def check_contamination(train_data, eval_data, n=8):
    """Check for n-gram overlap between training and eval sets."""
    from collections import Counter

    def get_ngrams(text, n):
        words = text.lower().split()
        return set(tuple(words[i:i+n]) for i in range(len(words) - n + 1))

    train_ngrams = set()
    for example in train_data:
        text = example.get("instruction", "") + " " + example.get("output", "")
        train_ngrams.update(get_ngrams(text, n))

    contaminated = []
    for i, example in enumerate(eval_data):
        text = example.get("instruction", "") + " " + example.get("output", "")
        eval_ngrams = get_ngrams(text, n)
        overlap = eval_ngrams & train_ngrams
        if overlap:
            contaminated.append(i)

    print(f"Potentially contaminated: {len(contaminated)}/{len(eval_data)} eval examples")
    return contaminated

Building Custom Evaluation Datasets

The best evaluation set is one you build specifically for your use case:

Define 5-10 capability categories your model should excel at
Write 10-20 test prompts per category covering easy, medium, and hard difficulty
Create reference outputs (ideal responses) for automated comparison
Include adversarial examples that test edge cases and failure modes
Keep the eval set completely separate from training data — never use it for any purpose other than evaluation

Standard Benchmarks

For broader comparison with other models:

MT-Bench: 80 multi-turn questions across 8 categories. Uses GPT-4 as judge. Good for conversational models.
AlpacaEval: 805 instructions evaluated by GPT-4. Measures general instruction-following ability.
MMLU: Multiple-choice questions across 57 subjects. Tests factual knowledge.
HumanEval / MBPP: Code generation benchmarks. Essential if your model generates code.

Practical Evaluation Workflow

def full_evaluation(model, tokenizer, test_set, base_model=None):
    """Complete evaluation pipeline."""
    results = {"automated": {}, "llm_judge": {}, "comparison": {}}

    # 1. Generate outputs
    predictions = []
    for example in test_set:
        output = generate(model, tokenizer, example["prompt"])
        predictions.append(output)

    # 2. Automated metrics (if reference outputs exist)
    if "reference" in test_set[0]:
        references = [ex["reference"] for ex in test_set]
        results["automated"]["exact_match"] = exact_match(predictions, references)

    # 3. LLM-as-judge scoring
    scores = []
    for example, pred in zip(test_set, predictions):
        score = llm_judge_score(example["prompt"], pred, "accuracy, helpfulness, format")
        scores.append(score["score"])
    results["llm_judge"]["mean_score"] = sum(scores) / len(scores)

    # 4. Comparison with base model (if available)
    if base_model:
        base_predictions = [generate(base_model, tokenizer, ex["prompt"]) for ex in test_set]
        wins = 0
        for example, ft_pred, base_pred in zip(test_set, predictions, base_predictions):
            result = llm_judge_compare(example["prompt"], ft_pred, base_pred, "overall quality")
            if result["winner"] == "A":
                wins += 1
        results["comparison"]["win_rate"] = wins / len(test_set)

    return results

In the next lesson, we will cover how to merge your LoRA adapter, export it to production formats, and prepare it for deployment.