RAG stands for Retrieval-Augmented Generation: text generation augmented by information retrieval. Before answering, first find relevant information, then answer based on it.
The Doctor and Medical Records Analogy
Imagine visiting a specialist doctor. Years of education, thousands of papers studied. But the first thing they do? Open your medical file. Because without your specific history, allergies, and test results, even the best doctor can only guess. LLM is that doctor: lots of general knowledge but no patient file. RAG is the system that finds the file and puts it in front of the doctor.
Three Stages of RAG
Stage 1: Retrieve
When a user asks a question, search your information sources and find the most relevant chunks. You do not want ALL information — just the most relevant ones.
Stage 2: Augment
Combine the retrieved information with the user’s question to create an enriched prompt with all necessary context.
Stage 3: Generate
Give the enriched prompt to the LLM. The model now has both its general knowledge and specific, up-to-date information to produce an accurate, natural, and trustworthy answer.
RAG vs Fine-tuning
- Fine-tuning: Changes the model’s behavior and style. One-time training. Expensive.
- RAG: Gives the model information and knowledge. Cheap, fast, always up-to-date.
- Best approach: Use both together when needed.
Two Phases of RAG
Ingestion (offline): Read documents → chunk them → embed each chunk → store in Vector Database.
Query (online): Embed the question → find similar vectors → get original text → combine with question → send to LLM → return answer.
Summary
- RAG = Retrieval + Augmentation + Generation
- Like a doctor who opens the patient file before diagnosis
- Two phases: Ingestion (offline) and Query (online)
- Cheaper, faster, and more flexible than fine-tuning for information delivery