A Quick Recap
So far we’ve learned how to chunk data, embed it, store it in a Vector Database, search it, and generate answers with proper prompts. But one big question remains: how do we know this system actually works well?
Without evaluation, you’re shooting in the dark. You might think your RAG is excellent, but users could be getting irrelevant answers. Or you might make a small change to chunking and break the entire system without realizing it.
RAG Triad — The Three Pillars of Evaluation
RAG evaluation starts with three core metrics called the RAG Triad. Each measures a different aspect of the system.
1. Context Relevance — Did You Find the Right Chunks?
First, you need to check if the retrieval stage works well. When a user asks a question, are the chunks returned from the Vector Database actually relevant?
Suppose a user asks “How do I change my password?” and the system returns a chunk about “Company history.” Clearly, Context Relevance is low.
How to measure it?
# Simple idea: ask an LLM to evaluate
evaluation_prompt = """
User question: {question}
Retrieved chunk: {retrieved_chunk}
Is this chunk relevant to answering the question?
Score: 0 (irrelevant) to 1 (fully relevant)
"""
You can do this for each retrieved chunk and take the average.
2. Groundedness — Is the Answer Based on Context?
Even if you found the right chunks, the LLM might give an answer that isn’t in those chunks. In other words, it hallucinates.
For example, the chunks say “Product X has a two-year warranty” but the LLM answers “Product X has a three-year warranty.” The context is correct, but the answer isn’t grounded.
How to measure it?
evaluation_prompt = """
Context: {retrieved_chunks}
Generated answer: {generated_answer}
Check each sentence of the answer:
- Is this sentence directly inferable from the Context?
- Score: 0 (fabricated) to 1 (fully grounded in Context)
"""
3. Answer Relevance — Does the Answer Address the Question?
This metric measures whether the final answer actually addresses the user’s question. The context might be correct, there might be no hallucination, but the answer might not be relevant to the question.
For example, the user asks “How do I change my password?” and the system answers “Passwords must be at least 8 characters.” Correct, but doesn’t answer the question.
How to measure it?
evaluation_prompt = """
Question: {question}
Answer: {generated_answer}
Does this answer directly address the user's question?
Score: 0 (irrelevant) to 1 (fully responsive)
"""
Relationship Between the Three Metrics
These three metrics complement each other. If all three are high, your system works well:
- High Context Relevance + High Groundedness + Low Answer Relevance = Search is good, LLM answers poorly
- Low Context Relevance + anything else = Problem is in the retrieval stage
- High Context Relevance + Low Groundedness = LLM is hallucinating
RAGAS — RAG Evaluation Framework
RAGAS (Retrieval Augmented Generation Assessment) is an open-source framework that automates RAG evaluation.
Simple Installation and Usage
pip install ragas
from ragas import evaluate
from ragas.metrics import (
context_relevancy,
faithfulness,
answer_relevancy,
context_precision,
context_recall
)
# Evaluation data
eval_data = {
"question": ["Question 1", "Question 2", ...],
"answer": ["Answer 1", "Answer 2", ...],
"contexts": [["Chunk 1-1", "Chunk 1-2"], ...],
"ground_truth": ["Correct answer 1", "Correct answer 2", ...]
}
# Evaluate
results = evaluate(
dataset=eval_data,
metrics=[
context_relevancy,
faithfulness,
answer_relevancy
]
)
print(results)
# {'context_relevancy': 0.82, 'faithfulness': 0.91, 'answer_relevancy': 0.87}
RAGAS Metrics
- Faithfulness: Equivalent to Groundedness. Is the answer faithful to the Context?
- Answer Relevancy: Is the answer relevant to the question?
- Context Precision: Are relevant chunks ranked higher?
- Context Recall: Has all necessary information been retrieved?
Note: RAGAS uses an LLM for evaluation, meaning it has costs. Each evaluation requires multiple API calls.
Manual vs Automated Evaluation
Manual Evaluation
The simplest approach: create a set of question-answer pairs and have yourself (or your team) review the system’s answers.
# Simple test set
test_cases = [
{
"question": "How do I change my password?",
"expected_answer": "From the Settings menu, Security section, Change Password option",
"expected_sources": ["doc_security_v2.pdf"]
},
# ...
]
Pros: Accurate, catches subtle issues
Cons: Time-consuming, not scalable, human judgment varies
Automated Evaluation with LLM
Ask another LLM (or the same one) to evaluate the answers. This is what RAGAS does.
Pros: Fast, scalable, reproducible
Cons: The LLM itself might evaluate incorrectly, API costs
Combining Both — The Best Approach
In practice, the best approach is:
- Create a Golden Test Set of 50-100 questions and evaluate manually
- For daily evaluations and after each change, use automated evaluation
- Update the Golden Test Set monthly
Common Failures — Where RAG Breaks Down
1. Missing Context
The necessary information simply doesn’t exist in the Knowledge Base. For example, a user asks about a new feature whose documentation hasn’t been added yet.
Solution: Monitor which questions get “I don’t know” answers. These indicate gaps in the Knowledge Base.
2. Wrong Context
Irrelevant chunks are retrieved. The problem might be with Embedding, Chunking, or ambiguous questions.
Solution: Use hybrid search. Add metadata filtering. Rewrite the query before searching.
3. Hallucination
The LLM says things not in the Context. Usually happens when the Prompt isn’t restrictive enough.
Solution: Tighten the prompt. Lower the LLM temperature (temperature = 0). Use better models.
4. Incomplete Answer
The answer is correct but incomplete. For example, there are 3 steps but only 2 are mentioned.
Solution: Increase top_k. Review chunk size — chunks might be too small.
5. Outdated Information
The Knowledge Base hasn’t been updated and answers are old.
Solution: Build an automatic update pipeline. Add date metadata and prefer newer chunks.
Building a Test Set — Step by Step
Let me practically explain how to build a good test set:
Step 1 — Collect Real Questions
Gather real user questions from system logs. If the system hasn’t launched yet, ask the support team to list the most common questions.
Step 2 — Categorize Questions
categories = {
"factual": ["What's the price of X?", "How long is Y's warranty?"],
"how_to": ["How do I install?", "How do I configure?"],
"troubleshooting": ["Why doesn't it work?", "What's error Z?"],
"comparison": ["What's the difference between X and Y?"],
"out_of_scope": ["What's the weather tomorrow?"]
}
Step 3 — Write Reference Answers
For each question, write the correct answer. These are the “Ground Truth.”
Step 4 — Run and Compare
for test in test_cases:
rag_answer = rag_system.query(test["question"])
scores = {
"context_relevance": evaluate_context(test, rag_answer),
"groundedness": evaluate_groundedness(rag_answer),
"answer_relevance": evaluate_answer(test, rag_answer)
}
log_results(test, rag_answer, scores)
A/B Testing — Comparing Two Versions
When you want to make a change (like changing chunk size or switching embedding models), A/B Testing is the best way to measure.
# Simple A/B Testing concept
for question in test_questions:
answer_a = rag_v1.query(question) # Current version
answer_b = rag_v2.query(question) # New version
score_a = evaluate(question, answer_a)
score_b = evaluate(question, answer_b)
results.append({
"question": question,
"v1_score": score_a,
"v2_score": score_b,
"winner": "v2" if score_b > score_a else "v1"
})
# Final result
v2_win_rate = sum(1 for r in results if r["winner"] == "v2") / len(results)
print(f"V2 was better in {v2_win_rate:.0%} of cases")
Important note: Always use a fixed set of questions. If questions differ each time, you can’t make a fair comparison.
Summary
In this episode you learned:
- RAG Triad: the three core evaluation metrics — Context Relevance, Groundedness, Answer Relevance
- RAGAS is a ready-made framework for automated evaluation
- Combining manual and automated evaluation is the best approach
- Common RAG failures and how to diagnose them
- How to build a good test set
- How A/B Testing helps you measure changes
Now you know how well your system works. But can it be improved? Absolutely! In the next episode, we’ll discuss Advanced RAG Techniques — Reranking, Query Expansion, HyDE, and other fascinating methods that dramatically improve RAG quality.