We have learned about open-source models — Llama, Qwen, DeepSeek — and how to choose among them. But one big problem remains unsolved: these models only know what was in their training data. They are unaware of new information, your company data, or the PDF you received yesterday. Today we solve this with RAG.
Four Big Limitations of LLM Alone
- Outdated information — Every model has a knowledge cutoff date
- Hallucination — Confidently producing wrong answers
- No access to proprietary data — Internal documents, databases, emails are invisible
- Memory limitations — Context window caps how much text can be processed at once
RAG — The Doctor Analogy
A brilliant doctor with years of education still opens your medical file before diagnosis. RAG means: before the doctor answers, bring the relevant file from the archive and put it in front of them.
Basic RAG Architecture
Stage 1: Document Preparation (Indexing)
- Collect documents (PDF, Word, web pages, databases)
- Convert to plain text
- Chunking: split into smaller pieces
- Embedding: convert each chunk to a numerical vector
- Store in Vector Database
Stage 2: Search (Retrieval)
- Embed the user’s question
- Compare question vector with stored vectors
- Find the most relevant chunks
Stage 3: Answer Generation
Relevant chunks + user question go to the LLM, which generates the final answer.
Key Components
- Embedding models: BGE-M3, Multilingual-E5 (good for non-English)
- Vector databases: ChromaDB (simple start), Qdrant (production), Weaviate, Pinecone, pgvector
- Chunking: Recursive character splitting recommended as starting point
RAG vs Fine-tuning
- RAG: When data changes frequently. Just update the index.
- Fine-tuning: When you want to change model behavior/style.
- Both together: Best results — fine-tune for style, RAG for knowledge.