Embedding — When Text Becomes Numbers

Episode 3 20 min

In the previous episode we saw that RAG has three stages: retrieval, augmentation, and generation. But an important question remained: how does the system know which text chunks are relevant to the user’s question? The answer is one word: Embedding.

The Problem: Computers Do Not Understand Language

Computers only work with numbers. The first attempts to represent words were simple: assign each word a number (dog=1, cat=2, car=3). But this gives no information about meaning or relationships between words.

The City of Meaning Analogy

Think of an imaginary city where related words live close together and unrelated words live far apart. Each word has a precise address (numerical coordinates). “Dog” and “cat” are in the same neighborhood (pets), “Paris” and “London” are in the European cities neighborhood, “happy” and “joyful” are practically roommates.

Vector Algebra — The Magic of Embeddings

king - man + woman ≈ queen\nParis - France + Italy ≈ Rome

This shows embeddings truly capture meaning, not just characters.

Cosine Similarity

The most common way to measure how similar two texts are: measure the angle between their vectors. Same direction (0°) = similarity of 1. Perpendicular (90°) = similarity of 0.

Popular Embedding Models

  • OpenAI text-embedding-3-small/large — High quality, API-based
  • Cohere embed-v3 — Excellent multilingual support
  • BGE, E5, all-MiniLM-L6-v2 — Open-source options

Practical Tips

  1. Use the same embedding model for both queries and documents
  2. Normalize vectors before computing similarity
  3. Use batch processing for efficiency
  4. Cache embeddings — generate once, reuse many times
  5. For non-English text, use multilingual models

Summary

  • Embedding converts text to numerical vectors preserving meaning
  • Like giving addresses to words in a city of meaning
  • Math operations on vectors are semantically meaningful
  • Cosine Similarity measures how similar two texts are
  • Various models available: OpenAI, Cohere, and open-source
  • For non-English languages, multilingual models work better