Evaluation Metrics
Why Evaluation Matters
Building a RAG pipeline is one thing. Knowing whether it actually works well is another. Without rigorous evaluation, you are flying blind -- making changes to your chunking strategy, embedding model, or retrieval parameters with no way to measure whether things improved or got worse.
RAG evaluation is harder than traditional ML evaluation because there are multiple stages that can fail independently. The retriever might find the right documents but the LLM might hallucinate anyway. The LLM might generate a perfect answer but from irrelevant context. You need metrics that evaluate each stage separately and the pipeline as a whole.
The Four Key Metrics
Faithfulness
Does the generated answer accurately reflect what the retrieved context says? A faithful answer does not add information that is not in the context and does not contradict the context.
- High faithfulness: The answer only states things that are in the retrieved documents.
- Low faithfulness: The answer includes claims that cannot be found in the context (hallucinations).
Answer Relevance
Does the generated answer actually address the user's question? An answer can be faithful to the context but completely miss the point of the question.
- High relevance: The answer directly addresses what was asked.
- Low relevance: The answer discusses tangentially related topics or misinterprets the question.
Context Precision
Of the retrieved chunks, how many are actually relevant to the question? A high context precision means the retriever is not wasting context window space on irrelevant documents.
- High precision: Most retrieved chunks are useful for answering the question.
- Low precision: The retriever returns many chunks that are not relevant.
Context Recall
Of all the relevant information in the knowledge base, how much did the retriever actually find? High recall means the retriever is not missing important documents.
- High recall: The retriever finds all (or most) of the relevant information.
- Low recall: The retriever misses key documents that contain the answer.
The RAGAS Framework
RAGAS (Retrieval Augmented Generation Assessment) is the most widely used framework for evaluating RAG systems. It provides automated metrics for all four dimensions.
Setting Up RAGAS
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
from datasets import Dataset
# Prepare your evaluation dataset
eval_data = {
"question": [
"What is the vacation policy for new employees?",
"How do I submit an expense report?",
"What is the dress code?",
],
"answer": [
# Generated answers from your RAG pipeline
"New employees receive 15 days of PTO in their first year.",
"Submit expense reports through the HR portal within 30 days.",
"The dress code is business casual Monday through Thursday.",
],
"contexts": [
# Retrieved contexts (list of strings per question)
["New hires are entitled to 15 days of paid time off..."],
["Expense reports must be submitted via the HR portal..."],
["Our dress code policy requires business casual attire..."],
],
"ground_truth": [
# Reference answers (for recall calculation)
"New employees get 15 days of PTO, accruing at 1.25 days per month.",
"Use the HR portal to submit expense reports within 30 days of the expense.",
"Business casual Monday-Thursday, casual Friday.",
],
}
dataset = Dataset.from_dict(eval_data)
# Run evaluation
results = evaluate(
dataset,
metrics=[
faithfulness,
answer_relevancy,
context_precision,
context_recall,
],
)
print(results)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88,
# 'context_precision': 0.85, 'context_recall': 0.78}
Interpreting RAGAS Scores
- Faithfulness > 0.9: Your LLM is staying grounded in the context. Good.
- Answer Relevance > 0.85: Answers are addressing the questions well.
- Context Precision > 0.8: Your retriever is finding relevant documents.
- Context Recall > 0.75: Your retriever is not missing too much.
Scores below these thresholds indicate specific areas to improve. Low faithfulness means you need better prompting or a more reliable LLM. Low context precision means your retrieval is too broad. Low recall means your retrieval is too narrow.
Building an Evaluation Dataset
The quality of your evaluation depends entirely on the quality of your test dataset. Here is how to build one:
Manual Curation (Gold Standard)
Have domain experts create question-answer pairs from your actual documents. This is time-consuming but produces the most reliable evaluation data.
# Structure for a manually curated eval dataset
eval_questions = [
{
"question": "What is the maximum parental leave duration?",
"ground_truth": "Employees can take up to 16 weeks of parental leave.",
"source_document": "hr_policies/parental_leave.md",
"difficulty": "easy", # easy, medium, hard
},
{
"question": "Can employees combine parental leave with PTO?",
"ground_truth": "Yes, employees may use accrued PTO to extend parental leave.",
"source_document": "hr_policies/parental_leave.md",
"difficulty": "medium",
},
]
Synthetic Generation (Scalable)
Use an LLM to generate questions from your documents. This scales better but requires validation.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
llm = ChatOpenAI(model="gpt-4o", temperature=0.7)
def generate_eval_questions(document_text: str, n_questions: int = 5) -> list[dict]:
"""Generate evaluation questions from a document."""
prompt = ChatPromptTemplate.from_messages([
("system", """Generate {n} question-answer pairs from this document.
Each pair should:
- Test a specific fact from the document
- Have a clear, verifiable answer
- Vary in difficulty
Return as JSON array with "question" and "answer" fields.
Document:
{document}"""),
("human", "Generate the question-answer pairs."),
])
response = (prompt | llm).invoke({
"n": n_questions,
"document": document_text,
})
return response.content
# Generate questions from each document in your knowledge base
# Then have a human review and correct them
Tip: Always have a human review synthetically generated eval questions. LLMs can generate questions that are ambiguous, too easy, or have incorrect answers. A 30-minute review pass catches most issues.
LLM-as-Judge Evaluation
When you do not have ground truth answers, you can use an LLM to evaluate the quality of your RAG responses. This is called LLM-as-Judge.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
judge_llm = ChatOpenAI(model="gpt-4o", temperature=0)
JUDGE_PROMPT = ChatPromptTemplate.from_messages([
("system", """You are an impartial evaluator. Rate the quality of an AI
assistant's answer to a user question, given the source context.
Evaluate on three dimensions (1-5 each):
1. ACCURACY: Is the answer factually correct based on the context?
2. COMPLETENESS: Does the answer fully address the question?
3. CONCISENESS: Is the answer appropriately brief without losing info?
Respond with JSON:
{{"accuracy": N, "completeness": N, "conciseness": N, "reasoning": "..."}}"""),
("human", """Context: {context}
Question: {question}
Answer: {answer}
Evaluate this answer."""),
])
def evaluate_response(question: str, answer: str, context: str) -> dict:
"""Use LLM-as-judge to evaluate a RAG response."""
result = (JUDGE_PROMPT | judge_llm).invoke({
"context": context,
"question": question,
"answer": answer,
})
return result.content
A/B Testing RAG Configurations
When you change a component of your RAG system (new embedding model, different chunk size, added re-ranking), you need to compare the old and new configurations on the same test set.
def ab_test_configs(eval_dataset: list, config_a, config_b) -> dict:
"""Compare two RAG configurations on the same eval set."""
results_a = []
results_b = []
for item in eval_dataset:
question = item["question"]
# Run config A
answer_a = config_a["chain"].invoke(question)
score_a = evaluate_response(question, answer_a, item.get("ground_truth", ""))
results_a.append(score_a)
# Run config B
answer_b = config_b["chain"].invoke(question)
score_b = evaluate_response(question, answer_b, item.get("ground_truth", ""))
results_b.append(score_b)
return {
"config_a_avg": sum(results_a) / len(results_a),
"config_b_avg": sum(results_b) / len(results_b),
"config_a_details": results_a,
"config_b_details": results_b,
}
Human Evaluation Protocols
Automated metrics are valuable but they are not a complete picture. Human evaluation catches nuances that LLM-judges miss: answer tone, formatting quality, helpfulness, and whether the answer would actually be useful to the end user.
Simple Thumbs Up/Down Protocol
For each RAG response, have evaluators answer:
- Is the answer correct? (Yes/No)
- Is the answer complete? (Yes/No)
- Would you trust this answer? (Yes/No)
- Is any information missing? (Free text)
Comparative Protocol
Show evaluators the same question answered by two different RAG configurations. Ask them which answer is better and why. This is more reliable than absolute scoring because humans are better at comparisons than absolute judgments.
Continuous Evaluation in Production
Evaluation is not a one-time activity. Set up automated evaluation that runs on a schedule:
import json
from datetime import datetime
def run_eval_suite(chain, eval_dataset: list, output_file: str):
"""Run evaluation suite and save results."""
results = []
for item in eval_dataset:
answer = chain.invoke(item["question"])
score = evaluate_response(
item["question"], answer, item.get("ground_truth", "")
)
results.append({
"question": item["question"],
"answer": answer,
"score": score,
"timestamp": datetime.now().isoformat(),
})
# Save results
with open(output_file, "w") as f:
json.dump(results, f, indent=2)
# Calculate averages
avg_score = sum(r["score"] for r in results if isinstance(r["score"], (int, float))) / len(results)
print(f"Average score: {avg_score:.2f}")
return results
Tips for Effective Evaluation
- Start with 50-100 hand-curated questions. This is enough to get statistically meaningful results and catch major issues.
- Include edge cases. Questions with no answer in the knowledge base, ambiguous questions, questions that span multiple documents.
- Track metrics over time. A dashboard showing faithfulness, relevance, precision, and recall trends helps you catch regressions before users notice.
- Separate retrieval evaluation from generation evaluation. If faithfulness drops, is it because the retriever found the wrong documents or because the LLM hallucinated? Test each stage independently.
- Evaluate with real user queries. Once your system is live, log actual user queries and include them in your eval set. Real queries are often messier and more varied than synthetic ones.
In the next lesson, you will learn how to harden your RAG system for production with caching, monitoring, security, and cost optimization.