Inside an LLM — Token, Embedding and Transformer

Episode 3 25 min

Every word you send to an LLM goes through a fascinating journey. In this episode, we open the hood and look inside a Large Language Model to understand three fundamental concepts: Tokens, Embeddings, and the Transformer architecture.

Tokens — The Atoms of LLM

LLMs do not process text character by character or word by word. They break text into units called tokens. A token is roughly 3-4 English characters or about 0.75 words. The word “understanding” might be two tokens: “under” + “standing”.

Why does this matter? Because everything in LLMs is measured in tokens: input cost, output cost, context window size, and processing speed. Understanding tokenization helps you estimate costs and optimize prompts.

The Challenge with Non-English Languages

Here is an important point: tokenizers are primarily trained on English text. This means non-English languages like Persian, Arabic, or Chinese typically require 2-4x more tokens for the same content. Writing in Persian is literally more expensive when using LLM APIs.

Embedding — When Text Becomes Numbers

Computers only understand numbers. So how does an LLM understand the meaning of words? Through embeddings — converting text into vectors (lists of numbers) that capture semantic meaning.

Think of it as giving every word an address in a “city of meaning.” Related words live in the same neighborhood: “cat” and “dog” are neighbors, while “cat” and “democracy” are on opposite sides of the city.

The fascinating property of embeddings is that math operations on vectors are meaningful:

king - man + woman ≈ queen\nParis - France + Italy ≈ Rome

This shows that embeddings truly capture meaning, not just characters.

Transformer Architecture — The Revolution

In 2017, Google published “Attention Is All You Need” and changed everything. The key innovation: Self-Attention.

Before Transformers, models processed text sequentially — word by word, left to right. Self-Attention allows the model to look at all words simultaneously and understand relationships between them regardless of distance.

When processing “The cat sat on the mat because it was tired,” self-attention helps the model understand that “it” refers to “cat” — even though several words separate them.

Context Window and Its Limitations

The context window is the maximum amount of text an LLM can consider at once. Current models range from 8K to over 1M tokens. But bigger is not always better — models tend to lose focus on information in the middle of very long contexts (the “Lost in the Middle” problem).

How Text Generation Works

LLMs generate text one token at a time by predicting the most probable next token. The Temperature parameter controls randomness: low temperature (0.1) means predictable, focused output; high temperature (0.9) means creative, diverse output.

Summary

  • Tokens are the basic processing units of LLMs, roughly 0.75 English words each
  • Non-English languages use more tokens for the same content
  • Embeddings convert text to meaningful number vectors
  • Transformer architecture and Self-Attention allow parallel processing of all words
  • Context Window limits how much text a model can consider
  • Temperature controls the creativity vs predictability of outputs