Combining MCP with RAG

Episode 9 18 minutes

Quick Recap

Throughout this series, we’ve learned MCP from the ground up. Core concepts, building servers, connecting to Claude, security, production deployment, and database integration. Now it’s time to take things further: combining MCP with RAG.

RAG, or Retrieval-Augmented Generation, is one of the most important techniques in the AI world. The idea is simple: instead of the AI model relying solely on its own memory, it first finds relevant information from an external source and then generates its answer.

When you combine MCP and RAG, something remarkable happens: AI that can connect to vector databases through standard MCP tools, perform semantic search, and deliver far more accurate answers.

Who is this for?
If you have a basic understanding of RAG, great. If not, don’t worry — I’ll explain RAG first before diving into how it integrates with MCP.

RAG in Plain Language

Imagine you ask Claude: “What’s our company’s leave policy?” Claude doesn’t know — because that information wasn’t in its training data. This is where RAG comes in.

With RAG, before Claude answers, the system searches through the company’s documentation, finds the relevant sections, and feeds them to Claude. Now Claude has the actual information and can give a precise, documentation-backed answer.

Analogy
RAG is like the difference between a closed-book and open-book exam. In a closed-book exam, you rely solely on your memory (you might forget or get things wrong). In an open-book exam, you can look up answers and give more accurate responses. RAG gives AI an “open book.”

The Components of RAG

A RAG system has three main components:

1. Embedding: Each piece of text is converted into a numerical vector. This vector represents the “meaning” of the text. Texts with similar meanings have vectors that are close together.

2. Vector Database: Where these vectors are stored and where you can perform “similarity” searches. When you ask a question, the question is also converted into a vector, and the nearest vectors (= most relevant content) are returned.

3. Generation: The AI model receives the found passages as context and generates its answer based on them.

Tip
Think of embedding as compressing the meaning of text into a series of numbers. “A dog is a popular pet” and “man’s best friend” would have similar vectors, even though they share no words in common.

Why MCP + RAG?

You might ask: “Well, RAG works without MCP. Why combine them?” Good question. Here’s the answer:

Without MCP: RAG is typically hardcoded into the application. The developer must specify when to search, in which source, and how results get added to the prompt. Every change means changing code.

With MCP: RAG search becomes a tool. The AI decides when to use it. It can search multiple times, combine results, and even refine its query if the initial results aren’t good enough.

Most importantly: when RAG is an MCP tool, any MCP client (Claude Desktop, Claude Code, any other app) can use it. You no longer need to implement RAG separately for each application.

Comparison
Traditional RAG: Search → results → prompt → answer (linear and fixed)
RAG + MCP: AI decides → search → analyze → maybe search again → final answer (dynamic and intelligent)

The Combined Architecture

Let’s walk through the architecture step by step.

Stage 1: Data Preparation (Ingestion)

First, you need to prepare your documents:

  1. Chunking: Split long documents into smaller pieces (e.g., 500 words each). Each chunk should cover a complete concept.
  2. Embedding: Convert each chunk into a vector using an embedding model (like OpenAI’s text-embedding-3-small or free models like all-MiniLM-L6-v2).
  3. Storage: Store the vectors in a vector database.

This stage happens once (and is repeated whenever documents are updated).

Stage 2: The MCP Server

Now you build an MCP Server that exposes search tools. A user (or AI) asks a question, the server converts it to a vector, searches the vector database, and returns the most relevant results.

Stage 3: AI Intelligence

The AI reads the search results and answers based on them. But here’s the interesting part — the AI can:

  • If results aren’t sufficient, search again with a different query
  • Search across multiple sources and combine results
  • Filter: request only documents from 2026, for example
  • Evaluate results and say “I’m not sure” if no relevant documents were found

MCP Tools for RAG

Now let’s get practical. An MCP Server for RAG typically provides these tools:

The search Tool

The primary tool. It takes a text query and returns the most relevant chunks.

// Tool: search
{
  name: "search_documents",
  description: "Search the knowledge base using semantic search",
  inputSchema: {
    type: "object",
    properties: {
      query: { 
        type: "string", 
        description: "Natural language search query" 
      },
      top_k: { 
        type: "number", 
        description: "Number of results (default 5)",
        default: 5
      },
      filter: {
        type: "object",
        description: "Optional metadata filters",
        properties: {
          category: { type: "string" },
          date_after: { type: "string" }
        }
      }
    },
    required: ["query"]
  }
}

The list_collections Tool

Similar to list_tables in the previous episode. It tells the AI what collections exist in the vector database. For example, “technical documentation,” “company policies,” “product guides.”

// Tool: list_collections
{
  name: "list_collections",
  description: "List available document collections",
  handler: async () => {
    return [
      { name: "technical_docs", description: "Technical documentation", doc_count: 1250 },
      { name: "company_policies", description: "HR and company policies", doc_count: 89 },
      { name: "product_guides", description: "Product user guides", doc_count: 340 }
    ];
  }
}

The get_document Tool

When the AI finds a relevant chunk, it might want to see the full original document (not just that chunk). This tool returns a complete document by ID.

// Tool: get_document
{
  name: "get_document",
  description: "Get the full content of a document by ID",
  inputSchema: {
    type: "object",
    properties: {
      document_id: { type: "string" }
    },
    required: ["document_id"]
  }
}
Why three tools?
This pattern mirrors the database episode: discover (list_collections) → search (search_documents) → detail (get_document). With these three tools, AI can decide for itself how to reach an answer.

Vector Databases

For storing and searching vectors, you have several popular options:

Pinecone: Cloud service. Easiest to start with. Handles hosting and scaling. But it’s paid, and your data lives on someone else’s servers.

Weaviate: Open-source and self-hosted. Available both as a cloud service and for local deployment. Good filtering and metadata capabilities.

ChromaDB: Lightweight and simple. Great for prototyping and small projects. Extremely easy to use with Python.

Qdrant: Fast and optimized. Written in Rust. Available as both cloud and self-hosted. Clean, straightforward API.

pgvector: If you’re already using PostgreSQL, pgvector is an extension that adds vector search capabilities to your existing database. No separate service needed.

Recommendation
If you already have PostgreSQL, start with pgvector — no new service required. For larger projects, evaluate Qdrant or Weaviate. For prototyping, ChromaDB is the fastest path.

Practical Example: Company Knowledge Base

Let me walk through a real-world scenario. Imagine your company has 500 pages of internal documentation — from product guides to HR policies. Employees ask questions like these every day:

  • “What’s the process for requesting time off?”
  • “What are the API rate limits for version 3?”
  • “How do I report a critical bug?”

Without RAG, they either search manually (time-consuming) or ask colleagues (takes two people’s time). With MCP + RAG:

An employee types in Claude: “What’s the process for requesting time off?”

Claude calls search_documents with the query “process for requesting time off.” The vector database returns the most relevant chunks. Claude answers based on the actual company documentation — citing its sources.

If the answer isn’t sufficient, Claude can search again with a different query or use get_document to see the full document.

Chunking: The Art of Splitting

One of the most important parts of RAG that many people underestimate is chunking — how you divide documents into pieces.

Too large: If chunks are too big (e.g., an entire page), search accuracy drops because irrelevant sections get included.

Too small: If they’re too small (e.g., a single sentence), context is lost and the AI can’t grasp the full concept.

Optimal: Typically 300-500 words with 50-100 words of overlap between chunks. Overlap ensures that content crossing chunk boundaries doesn’t get lost.

Analogy
Chunking is like slicing pizza. If you cut slices too large, they’re hard to handle. If you cut them too small, you just have meaningless crumbs. The right size means each piece is a complete, meaningful bite.

Chunking Strategies

Fixed-size chunking: Every 500 words becomes a chunk. Simple, but might cut in the middle of a sentence.

Semantic chunking: Splits based on document structure — each section (heading) becomes a chunk. More precise but more complex.

Recursive chunking: First tries to split by headings. If a chunk is still too big, splits by paragraphs. If still too big, by sentences. The best general-purpose approach.

Improving Search Quality

Several techniques can boost the quality of RAG results:

Hybrid Search: Combining semantic search (vector) with keyword search. Sometimes users are looking for an exact term (like a product model number) that semantic search won’t find.

Reranking: After the initial search, re-sort results with a separate model (reranker). Rerankers are more accurate than simple vector search but slower — so they’re only run on the top 20-50 results.

Query Expansion: Transform the original user question into multiple queries and search all of them. For example, expand “leave policy” to also include “time off,” “vacation policy,” “PTO guidelines.”

Metadata Filtering: Use metadata for more precise filtering. For example, only HR documents, or only documents updated in the last 6 months.

MCP and Embedding as a Tool

Here’s an interesting point: the embedding operation itself can be an MCP tool. Imagine the AI wants to add a new text to the vector database (for instance, when a new document is created).

// Tool: add_document
{
  name: "add_document",
  description: "Add a new document to the knowledge base",
  inputSchema: {
    type: "object",
    properties: {
      title: { type: "string" },
      content: { type: "string" },
      collection: { type: "string" },
      metadata: { type: "object" }
    },
    required: ["title", "content", "collection"]
  }
}

Of course, this tool should be offered with caution. Adding incorrect content to the knowledge base can degrade the quality of the entire system. Always include a human approval layer.

Warning
Write tools (add/edit/delete) on the knowledge base are highly sensitive. If AI adds incorrect data, all subsequent queries get incorrect answers too. Human approval for write operations is mandatory.

Challenges and Solutions

Combining MCP and RAG comes with its own unique challenges:

Challenge 1: Hallucination. Even with RAG, AI might “hallucinate” and say something not in the documents. Solution: instruct the AI to always cite its source. If it can’t find a source, it should say “I couldn’t find information on this topic.”

Challenge 2: Stale documents. If documents aren’t updated, AI gives outdated answers. Solution: add a last-updated date to metadata and configure the AI to warn when a document is old.

Challenge 3: Multiple languages. If documents are in multiple languages, the embedding model must be multilingual. Models like multilingual-e5-large are designed for this.

Challenge 4: Scale. As the document count grows (say, 1 million chunks), search slows down. Optimized vector databases like Qdrant solve this problem.

Challenge 5: Embedding cost. Converting each chunk to a vector costs money (if you use an API). For large document sets, use local models like all-MiniLM-L6-v2 which are free.

An Advanced Architecture: Multi-Source RAG

The real power of MCP + RAG becomes clear when you have multiple sources:

{
  "mcpServers": {
    "internal-docs": {
      "command": "node",
      "args": ["servers/rag-server.js", "--collection=internal"]
    },
    "product-docs": {
      "command": "node",
      "args": ["servers/rag-server.js", "--collection=product"]
    },
    "database": {
      "command": "node",
      "args": ["servers/db-server.js"]
    }
  }
}

Now the AI has access to internal documentation, product documentation, and the database simultaneously. It can answer a question from multiple angles:

“Why is customer #452 unhappy with the product?”

The AI reads the customer’s order history from the database, finds known limitations from the product documentation, and searches for related reported bugs in internal docs. The result: a comprehensive analysis that turns hours of manual research into seconds.

Wrapping Up

Combining MCP and RAG is one of the most exciting applications of AI today. Here’s the summary:

  • RAG = search + retrieve + generate — AI finds information before answering
  • MCP = standard interface — search becomes a tool the AI decides when to use
  • Vector Database = semantic memory — texts are searched by meaning, not just keywords
  • Good chunking = the key to quality — how you split documents matters enormously
  • Multi-source = the real power — AI combines information from multiple sources

With MCP, RAG is no longer a hardcoded feature — it’s a dynamic, reusable tool that any AI client can leverage.

Next Steps
If you want to learn more about RAG, I recommend starting with ChromaDB + a free embedding model. Index a small collection of documents and build an MCP Server that queries it. The best way to learn is by building!