Advanced RAG — Reranking, Hybrid Search and Query Expansion

Episode 9 22 minutes

A Quick Review

In the previous episode, we learned how to evaluate RAG and identify its common failures. Now it is time to explore techniques that dramatically improve RAG quality. These are the techniques that distinguish a basic RAG from a professional one.

Reranking — The Second Stage of Ranking

During the retrieval stage, you get back a list of “probably relevant” chunks. But the problem is that the initial ranking (based on Cosine Similarity) is not always accurate. The third result might actually be the most relevant, not the first.

Reranking means after the initial search, a smarter model comes in and re-ranks the results.

Bi-Encoder vs Cross-Encoder

Let me explain the difference between these two models with an example.

Bi-Encoder (the standard Embedding model): It is like reading the question and each chunk separately, summarizing them, then comparing the summaries. It is fast but might miss subtle connections.

Cross-Encoder: It is like placing the question and chunk side by side and reading them together. It is slower but much more accurate because it sees direct relationships between words in the question and the chunk.

# Bi-Encoder (Stage 1 — Retrieval)
query_vec = embed(query)        # Embed the query separately
doc_vecs = embed(documents)     # Embed the chunks separately
scores = cosine_similarity(query_vec, doc_vecs)

# Cross-Encoder (Stage 2 — Reranking)
for doc in top_20_results:
    score = cross_encoder(query + " [SEP] " + doc)
    # Processes the query and chunk together

Practical Example with Cohere Rerank

import cohere

co = cohere.Client("YOUR_API_KEY")

results = co.rerank(
    query="How do I change my password?",
    documents=initial_results,  # 20 initial results
    top_n=5,                    # Return only the top 5
    model="rerank-multilingual-v3.0"
)

# Now results are ranked more accurately

Note: Apply Reranking on 20-50 initial results, not the entire database. If you rerank everything, you lose the speed advantage of ANN.

Impact of Reranking

Research shows Reranking typically improves retrieval accuracy by 5 to 15 percent — especially when queries are complex or ambiguous.

Query Expansion — Broaden the Question

Sometimes the user query is too short or vague. “Network problem” could mean a thousand things. Query Expansion means rewriting or expanding the query to get better search results.

Method 1 — Rewriting with LLM

rewrite_prompt = """
User question: {original_query}

Rewrite this question into 3 more specific and precise 
questions that preserve the original meaning.
"""

# Input: "network problem"
# Output:
# 1. "What causes WiFi network connection drops?"
# 2. "How to troubleshoot internet connection issues?"
# 3. "What network settings fix connection problems?"

Now you search for all three questions and merge the results. The chances of finding relevant chunks increase significantly.

Method 2 — Multi-Query Retrieval

Similar to the previous method, but more systematic:

def multi_query_retrieve(original_query, n_queries=3):
    # Step 1: Generate different questions
    queries = llm.generate_variations(original_query, n=n_queries)
    
    # Step 2: Search for each question
    all_results = []
    for q in queries:
        results = vector_search(q, top_k=10)
        all_results.extend(results)
    
    # Step 3: Deduplicate and rerank
    unique_results = deduplicate(all_results)
    return rerank(original_query, unique_results, top_n=5)

HyDE — Hypothesize Before You Search

Hypothetical Document Embeddings (HyDE) is a creative idea. Instead of directly embedding and searching with the question, first ask the LLM to write a “hypothetical answer.” Then embed and search with that hypothetical answer.

Why does this work? Because the vector of an “answer” is closer to the vectors of “chunks containing the answer” than the vector of a “question.”

def hyde_retrieval(question):
    # Step 1: Generate a hypothetical answer (without Context)
    hypothetical_answer = llm.generate(f"""
    Write a hypothetical, complete answer to this question.
    It does not need to be accurate, just similar to a real answer.
    
    Question: {question}
    """)
    
    # Step 2: Embed the hypothetical answer
    hyde_vector = embed(hypothetical_answer)
    
    # Step 3: Search using the hypothetical answer vector
    results = vector_search(hyde_vector, top_k=10)
    
    return results

Example:

  • Question: “How do I upload a PDF?”
  • Hypothetical answer: “To upload a PDF file, go to the file management section, click the upload button, select your PDF file and…”
  • Now we search with this hypothetical answer vector — chunks that are actually about PDF uploading are more likely to be found

Note: HyDE is not always better. For simple, clear questions it might make no difference or even perform worse. For vague and complex questions, it is usually very effective.

Parent-Child Chunking

In the Chunking episode, we said chunks should not be too large (because Embedding accuracy drops) and should not be too small (because context is lost). Parent-Child Chunking is an elegant solution to this problem.

The idea: Split text into two levels:

  • Child Chunks: Small chunks (e.g., 200 tokens) — used for search
  • Parent Chunks: Larger chunks (e.g., 1000 tokens) — sent to the LLM as context
def parent_child_chunking(document):
    # Step 1: Large chunking (parent)
    parent_chunks = split(document, chunk_size=1000)
    
    # Step 2: Split each parent into children
    for parent in parent_chunks:
        children = split(parent, chunk_size=200)
        for child in children:
            # Store child with reference to parent
            store(child, metadata={"parent_id": parent.id})
    
def retrieve(query):
    # Step 1: Search among children (small chunks)
    matching_children = vector_search(query, top_k=5)
    
    # Step 2: Return the corresponding parents
    parent_ids = set(c.metadata["parent_id"] for c in matching_children)
    parents = fetch_parents(parent_ids)
    
    return parents  # Larger chunks with more context

Why is it good? Search on small chunks is more accurate (because the vector is more focused). But larger context is sent to the LLM (because it contains more information).

It is like searching for a heading in a book index (precise search), but when you find it, you read the entire chapter (complete context).

Contextual Compression — Smart Compression

Sometimes a retrieved chunk is 500 words long but only 2 sentences are actually relevant. The rest is noise that wastes precious Context Window space.

Contextual Compression means after retrieval, you compress each chunk and keep only the relevant parts.

def contextual_compression(query, retrieved_chunks):
    compressed = []
    for chunk in retrieved_chunks:
        result = llm.generate(f"""
        Question: {query}
        Text: {chunk}
        
        Extract only the parts of the text that are directly 
        relevant to the question. If nothing is relevant, 
        write "irrelevant".
        """)
        
        if result != "irrelevant":
            compressed.append(result)
    
    return compressed

Advantages:

  • Less Context Window consumption
  • LLM focuses on more relevant information
  • The “Lost in the Middle” problem is reduced

Disadvantages:

  • An extra LLM call is needed for each chunk
  • More latency and higher cost

Combining Techniques — The Advanced Pipeline

Now let us put all techniques together and build a complete pipeline:

def advanced_rag_pipeline(user_query):
    # Step 1: Query Expansion
    expanded_queries = expand_query(user_query, n=3)
    
    # Step 2: Multi-Query Retrieval
    all_chunks = []
    for q in expanded_queries:
        # Hybrid Search
        vector_results = vector_search(q, top_k=15)
        keyword_results = bm25_search(q, top_k=15)
        merged = rrf_merge(vector_results, keyword_results)
        all_chunks.extend(merged)
    
    # Step 3: Deduplication
    unique_chunks = deduplicate(all_chunks)
    
    # Step 4: Reranking
    reranked = cross_encoder_rerank(
        query=user_query,
        documents=unique_chunks,
        top_n=8
    )
    
    # Step 5: Contextual Compression
    compressed = contextual_compression(user_query, reranked)
    
    # Step 6: Generate answer
    answer = llm.generate(
        system_prompt=rag_system_prompt,
        context=compressed,
        question=user_query
    )
    
    return answer

Warning: You do not need to use all techniques! Start with basic RAG, then based on evaluation results (previous episode), identify your weaknesses and add the appropriate technique.

When to Use Which Technique?

Reranking: Almost always useful. If you are only adding one technique, make it this one.

Query Expansion: When users ask short or vague questions.

HyDE: When questions are very different from the document text (e.g., colloquial question, formal documentation).

Parent-Child Chunking: When your documents are well-structured (like articles with headings and subheadings).

Contextual Compression: When Context Window is limited or your chunks are very long.

Summary

In this episode, you learned that:

  • Reranking with Cross-Encoder makes search results much more accurate
  • Query Expansion and Multi-Query Retrieval handle vague questions better
  • HyDE improves search by generating hypothetical answers
  • Parent-Child Chunking improves both search accuracy and context richness simultaneously
  • Contextual Compression removes noise and optimizes Context Window usage
  • You do not need to use every technique — choose based on your needs

Now you have a powerful RAG that searches well, answers well, and you know how to evaluate it. But one big question remains: how do you take this to production and actually run it on a server? The next and final episode in this series is about RAG in Production — from scalability challenges to security and cost.