RAG — When LLM Gets Memory | Mahdi Bashirpour

We have learned about open-source models — Llama, Qwen, DeepSeek — and how to choose among them. But one big problem remains unsolved: these models only know what was in their training data. They are unaware of new information, your company data, or the PDF you received yesterday. Today we solve this with RAG.

Four Big Limitations of LLM Alone

Outdated information — Every model has a knowledge cutoff date
Hallucination — Confidently producing wrong answers
No access to proprietary data — Internal documents, databases, emails are invisible
Memory limitations — Context window caps how much text can be processed at once

RAG — The Doctor Analogy

A brilliant doctor with years of education still opens your medical file before diagnosis. RAG means: before the doctor answers, bring the relevant file from the archive and put it in front of them.

Basic RAG Architecture

Stage 1: Document Preparation (Indexing)

Collect documents (PDF, Word, web pages, databases)
Convert to plain text
Chunking: split into smaller pieces
Embedding: convert each chunk to a numerical vector
Store in Vector Database

Stage 2: Search (Retrieval)

Embed the user’s question
Compare question vector with stored vectors
Find the most relevant chunks

Stage 3: Answer Generation

Relevant chunks + user question go to the LLM, which generates the final answer.

Key Components

Embedding models: BGE-M3, Multilingual-E5 (good for non-English)
Vector databases: ChromaDB (simple start), Qdrant (production), Weaviate, Pinecone, pgvector
Chunking: Recursive character splitting recommended as starting point

RAG vs Fine-tuning

RAG: When data changes frequently. Just update the index.
Fine-tuning: When you want to change model behavior/style.
Both together: Best results — fine-tune for style, RAG for knowledge.