RAG Evaluation — How to Know If It Works Well

Episode 8 20 minutes

A Quick Recap

So far we’ve learned how to chunk data, embed it, store it in a Vector Database, search it, and generate answers with proper prompts. But one big question remains: how do we know this system actually works well?

Without evaluation, you’re shooting in the dark. You might think your RAG is excellent, but users could be getting irrelevant answers. Or you might make a small change to chunking and break the entire system without realizing it.

RAG Triad — The Three Pillars of Evaluation

RAG evaluation starts with three core metrics called the RAG Triad. Each measures a different aspect of the system.

1. Context Relevance — Did You Find the Right Chunks?

First, you need to check if the retrieval stage works well. When a user asks a question, are the chunks returned from the Vector Database actually relevant?

Suppose a user asks “How do I change my password?” and the system returns a chunk about “Company history.” Clearly, Context Relevance is low.

How to measure it?

# Simple idea: ask an LLM to evaluate
evaluation_prompt = """
User question: {question}
Retrieved chunk: {retrieved_chunk}

Is this chunk relevant to answering the question?
Score: 0 (irrelevant) to 1 (fully relevant)
"""

You can do this for each retrieved chunk and take the average.

2. Groundedness — Is the Answer Based on Context?

Even if you found the right chunks, the LLM might give an answer that isn’t in those chunks. In other words, it hallucinates.

For example, the chunks say “Product X has a two-year warranty” but the LLM answers “Product X has a three-year warranty.” The context is correct, but the answer isn’t grounded.

How to measure it?

evaluation_prompt = """
Context: {retrieved_chunks}
Generated answer: {generated_answer}

Check each sentence of the answer:
- Is this sentence directly inferable from the Context?
- Score: 0 (fabricated) to 1 (fully grounded in Context)
"""

3. Answer Relevance — Does the Answer Address the Question?

This metric measures whether the final answer actually addresses the user’s question. The context might be correct, there might be no hallucination, but the answer might not be relevant to the question.

For example, the user asks “How do I change my password?” and the system answers “Passwords must be at least 8 characters.” Correct, but doesn’t answer the question.

How to measure it?

evaluation_prompt = """
Question: {question}
Answer: {generated_answer}

Does this answer directly address the user's question?
Score: 0 (irrelevant) to 1 (fully responsive)
"""

Relationship Between the Three Metrics

These three metrics complement each other. If all three are high, your system works well:

  • High Context Relevance + High Groundedness + Low Answer Relevance = Search is good, LLM answers poorly
  • Low Context Relevance + anything else = Problem is in the retrieval stage
  • High Context Relevance + Low Groundedness = LLM is hallucinating

RAGAS — RAG Evaluation Framework

RAGAS (Retrieval Augmented Generation Assessment) is an open-source framework that automates RAG evaluation.

Simple Installation and Usage

pip install ragas

from ragas import evaluate
from ragas.metrics import (
    context_relevancy,
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall
)

# Evaluation data
eval_data = {
    "question": ["Question 1", "Question 2", ...],
    "answer": ["Answer 1", "Answer 2", ...],
    "contexts": [["Chunk 1-1", "Chunk 1-2"], ...],
    "ground_truth": ["Correct answer 1", "Correct answer 2", ...]
}

# Evaluate
results = evaluate(
    dataset=eval_data,
    metrics=[
        context_relevancy,
        faithfulness,
        answer_relevancy
    ]
)

print(results)
# {'context_relevancy': 0.82, 'faithfulness': 0.91, 'answer_relevancy': 0.87}

RAGAS Metrics

  • Faithfulness: Equivalent to Groundedness. Is the answer faithful to the Context?
  • Answer Relevancy: Is the answer relevant to the question?
  • Context Precision: Are relevant chunks ranked higher?
  • Context Recall: Has all necessary information been retrieved?

Note: RAGAS uses an LLM for evaluation, meaning it has costs. Each evaluation requires multiple API calls.

Manual vs Automated Evaluation

Manual Evaluation

The simplest approach: create a set of question-answer pairs and have yourself (or your team) review the system’s answers.

# Simple test set
test_cases = [
    {
        "question": "How do I change my password?",
        "expected_answer": "From the Settings menu, Security section, Change Password option",
        "expected_sources": ["doc_security_v2.pdf"]
    },
    # ...
]

Pros: Accurate, catches subtle issues

Cons: Time-consuming, not scalable, human judgment varies

Automated Evaluation with LLM

Ask another LLM (or the same one) to evaluate the answers. This is what RAGAS does.

Pros: Fast, scalable, reproducible

Cons: The LLM itself might evaluate incorrectly, API costs

Combining Both — The Best Approach

In practice, the best approach is:

  1. Create a Golden Test Set of 50-100 questions and evaluate manually
  2. For daily evaluations and after each change, use automated evaluation
  3. Update the Golden Test Set monthly

Common Failures — Where RAG Breaks Down

1. Missing Context

The necessary information simply doesn’t exist in the Knowledge Base. For example, a user asks about a new feature whose documentation hasn’t been added yet.

Solution: Monitor which questions get “I don’t know” answers. These indicate gaps in the Knowledge Base.

2. Wrong Context

Irrelevant chunks are retrieved. The problem might be with Embedding, Chunking, or ambiguous questions.

Solution: Use hybrid search. Add metadata filtering. Rewrite the query before searching.

3. Hallucination

The LLM says things not in the Context. Usually happens when the Prompt isn’t restrictive enough.

Solution: Tighten the prompt. Lower the LLM temperature (temperature = 0). Use better models.

4. Incomplete Answer

The answer is correct but incomplete. For example, there are 3 steps but only 2 are mentioned.

Solution: Increase top_k. Review chunk size — chunks might be too small.

5. Outdated Information

The Knowledge Base hasn’t been updated and answers are old.

Solution: Build an automatic update pipeline. Add date metadata and prefer newer chunks.

Building a Test Set — Step by Step

Let me practically explain how to build a good test set:

Step 1 — Collect Real Questions

Gather real user questions from system logs. If the system hasn’t launched yet, ask the support team to list the most common questions.

Step 2 — Categorize Questions

categories = {
    "factual": ["What's the price of X?", "How long is Y's warranty?"],
    "how_to": ["How do I install?", "How do I configure?"],
    "troubleshooting": ["Why doesn't it work?", "What's error Z?"],
    "comparison": ["What's the difference between X and Y?"],
    "out_of_scope": ["What's the weather tomorrow?"]
}

Step 3 — Write Reference Answers

For each question, write the correct answer. These are the “Ground Truth.”

Step 4 — Run and Compare

for test in test_cases:
    rag_answer = rag_system.query(test["question"])
    
    scores = {
        "context_relevance": evaluate_context(test, rag_answer),
        "groundedness": evaluate_groundedness(rag_answer),
        "answer_relevance": evaluate_answer(test, rag_answer)
    }
    
    log_results(test, rag_answer, scores)

A/B Testing — Comparing Two Versions

When you want to make a change (like changing chunk size or switching embedding models), A/B Testing is the best way to measure.

# Simple A/B Testing concept
for question in test_questions:
    answer_a = rag_v1.query(question)  # Current version
    answer_b = rag_v2.query(question)  # New version
    
    score_a = evaluate(question, answer_a)
    score_b = evaluate(question, answer_b)
    
    results.append({
        "question": question,
        "v1_score": score_a,
        "v2_score": score_b,
        "winner": "v2" if score_b > score_a else "v1"
    })

# Final result
v2_win_rate = sum(1 for r in results if r["winner"] == "v2") / len(results)
print(f"V2 was better in {v2_win_rate:.0%} of cases")

Important note: Always use a fixed set of questions. If questions differ each time, you can’t make a fair comparison.

Summary

In this episode you learned:

  • RAG Triad: the three core evaluation metrics — Context Relevance, Groundedness, Answer Relevance
  • RAGAS is a ready-made framework for automated evaluation
  • Combining manual and automated evaluation is the best approach
  • Common RAG failures and how to diagnose them
  • How to build a good test set
  • How A/B Testing helps you measure changes

Now you know how well your system works. But can it be improved? Absolutely! In the next episode, we’ll discuss Advanced RAG Techniques — Reranking, Query Expansion, HyDE, and other fascinating methods that dramatically improve RAG quality.