Embedding — When Text Becomes Numbers

In the previous episode we saw that RAG has three stages: retrieval, augmentation, and generation. But an important question remained: how does the system know which text chunks are relevant to the user’s question? The answer is one word: Embedding.

The Problem: Computers Do Not Understand Language

Computers only work with numbers. The first attempts to represent words were simple: assign each word a number (dog=1, cat=2, car=3). But this gives no information about meaning or relationships between words.

The City of Meaning Analogy

Think of an imaginary city where related words live close together and unrelated words live far apart. Each word has a precise address (numerical coordinates). “Dog” and “cat” are in the same neighborhood (pets), “Paris” and “London” are in the European cities neighborhood, “happy” and “joyful” are practically roommates.

Vector Algebra — The Magic of Embeddings

king - man + woman ≈ queen\nParis - France + Italy ≈ Rome

This shows embeddings truly capture meaning, not just characters.

Cosine Similarity

The most common way to measure how similar two texts are: measure the angle between their vectors. Same direction (0°) = similarity of 1. Perpendicular (90°) = similarity of 0.

Popular Embedding Models

OpenAI text-embedding-3-small/large — High quality, API-based
Cohere embed-v3 — Excellent multilingual support
BGE, E5, all-MiniLM-L6-v2 — Open-source options

Practical Tips

Use the same embedding model for both queries and documents
Normalize vectors before computing similarity
Use batch processing for efficiency
Cache embeddings — generate once, reuse many times
For non-English text, use multilingual models

Summary

Embedding converts text to numerical vectors preserving meaning
Like giving addresses to words in a city of meaning
Math operations on vectors are semantically meaningful
Cosine Similarity measures how similar two texts are
Various models available: OpenAI, Cohere, and open-source
For non-English languages, multilingual models work better