AI Glossary — Every AI Term You Need to Know

The world of artificial intelligence is packed with specialized terminology that can quickly become overwhelming if you don’t know what it all means. Whether you’re a business executive or a developer, every AI conversation comes loaded with jargon. I wrote this AI glossary to serve as a comprehensive, always-accessible reference. I’ve compiled over 100 terms here, from the simplest concepts to the most advanced techniques. The terms are arranged in a learning-friendly order — if you start from the top and read through, you’ll build a complete mental map of the AI landscape.

Each term is explained in clear, accessible language. Where it helps, I’ve added analogies and examples, and I’ve tried to show how concepts connect to each other. If you’ve already read my articles on What is RAG or the Prompt Engineering Guide, this glossary is the perfect companion. And if you’re just getting started, this is the best place to begin.

Tip

Bookmark this page. Whenever you encounter an unfamiliar term somewhere, come back here. This AI glossary is regularly updated with new terms and improved explanations.

Table of Contents

Foundational Concepts — Terms 1 to 15
Language Models — Terms 16 to 27
Notable Models — Terms 28 to 37
Key Techniques & Concepts — Terms 38 to 52
RAG — Terms 53 to 64
Fine-tuning — Terms 65 to 76
Agent — Terms 77 to 86
Architecture & Infrastructure — Terms 87 to 96
Business & Applications — Terms 97 to 108

1. Foundational Concepts

Terms 1 to 15

Before diving deeper, we need a shared vocabulary. This section covers the concepts that, if you don’t understand them, everything else will feel like a foreign language. Don’t worry — none of them are complicated. Read through once, and the entire path ahead will become much clearer.

Artificial Intelligence (AI)

The broadest umbrella term in this field. Artificial intelligence refers to any computer system that performs tasks which, if done by a human, we’d say required “intelligence.” This definition is extremely broad — everything from a simple email spam filter to ChatGPT qualifies as AI.

The key point is that AI is a vast spectrum. Netflix’s movie recommendation algorithm is AI. A license plate recognition system is AI. Claude and GPT are AI. When someone says “I’m an AI expert,” you should ask “What kind?” — much like hearing “I’m a doctor” and asking “What specialty?”

Machine Learning (ML)

A subset of AI. Instead of a programmer writing explicit rules, you tell the machine “figure out the rules from the data yourself.” Imagine building a spam filter. The old way: write a thousand rules (“if it contains ‘you won,’ it’s spam”). The ML way: give it thousands of emails and let it find the patterns on its own.

The revolution of ML is that the algorithm discovers patterns that would never occur to a human programmer. For instance, it might learn that emails sent at 3 AM containing images have a higher spam probability — a pattern no one would have thought to code.

Deep Learning (DL)

A subset of ML, but with one major difference: it uses multi-layered neural networks. In traditional ML, you had to manually engineer “features” — for example, to recognize a cat in a photo, you’d specify “count the eyes, check the fur color.” Deep Learning said: “No. The model will discover the features itself. Just give it raw data.”

The word “Deep” refers to the multiple layers. Early networks had 2-3 layers. Modern models can have 100+ layers. The deeper they go, the more complex the patterns they can learn. ChatGPT, Claude, and all LLMs are built on Deep Learning.

Neural Network (NN)

The core structure of Deep Learning. Picture a machine with several “layers.” Each layer is made up of multiple “nodes” (also called neurons). Information enters at the first layer, undergoes a mathematical transformation at each layer, and produces an output at the final layer.

Each connection between nodes carries a number called a “weight.” When you train a model, you’re essentially adjusting these weights until the output is correct. A model with 70 billion parameters means it has 70 billion of these weights. Think of a neural network as a pipeline of functions — each one takes input, transforms it, and passes it to the next.

Supervised Learning

The most common type of ML. You give the model data along with the “correct answer.” For example, 10,000 images of cats and dogs with labels (“this is a cat,” “this is a dog”). The model learns what distinguishes a cat from a dog.

It has two main applications: Classification (categorizing) like “Is this email spam or not?” and Regression (predicting a number) like “What’s this house worth?” When we say “I’m training the model on my own data” during Fine-tuning, we’re essentially doing Supervised Learning.

Unsupervised Learning

Here you provide only data, without answers. The model finds patterns on its own. For instance, you feed it data on 100,000 customers and it says “these customers naturally divide into 5 groups.” Nobody told it what the groups should be — it discovered them.

Two key applications: Clustering (grouping similar items) like customer segmentation, and Dimensionality Reduction (compressing features) like when you have 100 features and want to reduce them to 5 without losing essential information.

Reinforcement Learning (RL)

The model is placed in an environment and takes actions. Good actions earn rewards; bad ones earn penalties. Over time, it learns to maximize its reward. This is the same approach that trained AlphaGo — the AI that defeated world champions at chess and Go.

Why should you care? Because RLHF (Reinforcement Learning from Human Feedback) is one of the key stages in building LLMs. When ChatGPT becomes “polite” after initial training and avoids dangerous responses, that refinement was done through RL with human feedback.

Analogy

Think of the three types of learning this way: Supervised is like a teacher giving you the answer key. Unsupervised is like a child discovering patterns on their own. Reinforcement is like a gamer getting better through trial and error.

Feature	Supervised	Unsupervised	Reinforcement
Input	Data + answers	Data only	Environment + feedback
Output	Rules (model)	Patterns/groups	Optimal policy
Example	Spam detection	Customer segmentation	RLHF / games
Use in modern AI	Fine-tuning	Clustering	Model alignment

Tip

An easy way to remember the difference between RL and Supervised Learning: Supervised is like studying with an answer key; RL is like learning to swim — you jump in the water and figure it out through experience.

Classification

One of the most common ML tasks. The model takes an input and assigns it to one of several “categories.” For example: “Is this email spam or not?” (two categories), “Is this image a cat, dog, or bird?” (three categories), or “Rate this product 1 to 5 stars” (five categories).

Classification is everywhere — from spam filtering to disease diagnosis. Even when a language model like Claude generates a response, at a lower level it’s performing classification: “Which word is most likely to come next after this one?”

Regression

Unlike Classification which assigns categories, Regression predicts a number. “How much is this house worth?” (a number), “What will tomorrow’s temperature be?” (a number), “How many months until this customer churns?” (a number).

Regression is as foundational and time-tested as Classification, and it’s still widely used. In many business projects, simple Regression models outperform complex Deep Learning models — especially when you have limited data.

Feature Engineering

In traditional ML (before Deep Learning), this was the hardest part. It means manually extracting the best features from your data. For example, to predict house prices: square footage, number of rooms, distance to the subway, year built — you had to select these yourself.

Deep Learning dealt a major blow to Feature Engineering because the model learns features directly from raw data. However, Feature Engineering is still crucial in many projects (especially with tabular data). Simpler models like XGBoost with good Feature Engineering often outperform deep models without it.

Overfitting

One of the most common ML problems. The model performs brilliantly on training data but terribly on new data. Why? Because it memorized the training data rather than learning the actual patterns. Like a student who memorized past exam papers but didn’t understand the subject.

Overfitting is especially likely when you have limited data and a very complex model. Solutions include: more data, a simpler model, Dropout, Regularization, and Early Stopping. During Fine-tuning, if you’re not careful, the model can easily overfit — it’s one of the 10 common AI project mistakes.

Learning Rate

One of the most important hyperparameters. Imagine you’re descending a mountain, searching for the lowest point (the valley). Learning Rate is your step size. Too large? You overshoot the valley. Too small? It takes forever to get there.

In Fine-tuning, the Learning Rate is typically set very low (e.g., 2e-5) because you don’t want to destroy the model’s existing knowledge. You’re only making fine adjustments. If you set it too high, the model suffers “Catastrophic Forgetting” and loses its prior knowledge.

Warning

Overfitting is one of the most common reasons ML projects fail. Your model has 99% accuracy on training data but 60% on real customers? That’s overfitting. Always test on data the model hasn’t seen (a Validation Set).

Backpropagation

The algorithm that forms the backbone of neural network training. When the model gives a wrong answer, Backpropagation traces the error backwards (through previous layers) and adjusts the weights. It’s like a teacher saying “you got this wrong” and you going back to figure out where your reasoning went off track.

This algorithm was rediscovered in 1986 but wasn’t practically useful until 2012, when powerful GPUs became available. Now, every time a model is “trained,” it’s performing Backpropagation billions of times.

Dataset

The data used for training. It can be text, images, audio, or anything else. This is the single most important factor determining the quality of the final model. “Garbage in, garbage out” is AI’s oldest motto, and it’s still 100% true.

Dataset quality matters more than size. 1,000 clean, diverse examples are better than 1 million noisy, repetitive ones. In the Practical Fine-tuning series, I discuss how to build a good dataset in detail.

Parameter

Every weight inside a neural network is a parameter. When we say a model is “7B” (7 billion parameters), it means there are 7 billion numbers that were tuned during training. A 70B model has 70 billion. More parameters generally means a more capable model, but it also requires more VRAM.

A rule of thumb: each billion parameters requires roughly 2 GB of VRAM (at FP16 precision). So a 7B model needs about 14 GB, and a 70B model needs about 140 GB. With Quantization (explained later), these numbers drop significantly.

Checkpoint

You’ve now learned 15 foundational concepts. If you understand these, the rest of this AI glossary will be much easier. Now let’s move on to language models — the technology behind ChatGPT and Claude.

2. Language Models

Terms 16 to 27

Large Language Models (LLMs) are the stars of today’s AI world — the technology that powers ChatGPT and Claude. This section covers the terms you need to understand how LLMs work, from Token and Embedding to Context Window and Temperature. If you’re familiar with the AI Development Zero to Hero series, you’ve seen many of these. But here’s a quick, precise review.

Large Language Model (LLM)

A deep neural network specialized for natural language. LLMs are trained on trillions of words from the internet, use a specific architecture called Transformer, and their core task is predicting the next word. But thanks to their massive scale, they can do far more than simple prediction — they can reason, summarize, translate, and write code.

A fascinating discovery of the past decade: when a model is large enough and trained on enough data, capabilities emerge that nobody explicitly taught it (Emergent Abilities). For example, GPT-3 could suddenly solve math problems even though no one specifically trained it on mathematics.

Analogy

An LLM is like a walking library — it has read billions of pages of text and can now discuss virtually anything. But keep in mind: “reading” here means recognizing patterns, not “understanding” the way humans do.

Transformer

The revolutionary architecture behind all modern LLMs. Google introduced it in the landmark paper “Attention is All You Need” (2017). Before Transformers, models used RNNs and LSTMs, which were slow and struggled with long texts.

The Transformer’s core idea is the “Attention” mechanism — the model can compare every word with every other word and determine which ones are most relevant. For example, in the sentence “Ali took his dog to the park and played with him there,” the model needs to understand that “there” refers to “the park.” Attention makes that possible.

Token

The smallest unit of text that a model processes. Contrary to what you might expect, a token isn’t always a full word. It can be a whole word, part of a word, or even a single character. For example, “tokenization” might be split into 4 tokens: “token / iza / tion /”. Non-English text typically consumes more tokens because models are primarily trained on English.

Why does Token matter? Because API pricing (from OpenAI, Anthropic, etc.) is based on token count. Context Window size is also measured in tokens. A rule of thumb: each token is roughly 0.75 English words.

BPE (Byte-Pair Encoding)

The algorithm models use to convert text into tokens. The idea is simple: start with individual characters and merge the most frequent pairs. For instance, “t” and “h” appear together often, so “th” becomes one token. Then “the” becomes one token. And so on.

Why does BPE matter? Because tokenization quality directly affects model performance. Models with better tokenizers handle non-English languages more effectively. The Qwen vs. Llama comparison shows that models with better multilingual tokenizers process non-English text more efficiently.

Embedding

Converting text (or any type of data) into a list of numbers (a vector) that captures meaning. For example, the word “cat” becomes a 768-dimensional vector. Two words with similar meanings will have vectors that are close together. Embedding is the foundation of RAG, semantic search, and many other techniques.

Embedding models are separate from LLMs. They’re smaller, faster, and cheaper. Examples include text-embedding-3 from OpenAI or BGE-M3, which is open-source and supports multiple languages well. I’ve written more about them in the Vector Database article.

Context Window

The maximum number of tokens a model can “see” at once. For example, GPT-3 had a 4,096-token window. Modern models may have windows of 200,000, 1 million, or even 12 million tokens. The larger the window, the more conversation context the model can retain.

But bigger isn’t always better. It costs more, and there’s the “Lost in the Middle” phenomenon — models recall information at the beginning and end of the context better than what’s in the middle. Read the Context Window article for more details.

Prompt

What you give to the model. It can be a question, an instruction, or text. The art of writing good prompts is called Prompt Engineering — and it’s far more important than most people think. A well-crafted prompt can transform the output from “poor” to “excellent.”

I wrote a complete Prompt Engineering guide to show you how to ask better questions of models. Even if you only use ChatGPT and don’t write any code, Prompt Engineering will help you.

Completion / Response

The model’s output. When you send a Prompt, what the model returns is a Completion or Response. It’s called “Completion” because an LLM is fundamentally “completing” your text — predicting what should come after what you wrote.

You typically control the length of a Completion with the max_tokens parameter. For example, setting it to 1,000 means the model generates at most 1,000 tokens. Note that both the Prompt and the Completion consume space in the Context Window.

Common Question

“Is there a difference between Completion and Response?” Practically, no. Completion is the older term (from the GPT-3 Completion API era). Response is newer (from the Chat API era). Both mean the model’s output.

Autoregressive

How LLMs generate text. The model produces one token at a time, then adds that token to the input and predicts the next one. It’s like a writer who writes only one word at a time, reads what they’ve written so far, then decides on the next word.

This means LLM text generation speed is inherently limited — it can’t produce all words simultaneously. That’s why when you use ChatGPT, the answer appears word by word (streaming). It also means that any early mistake can cascade through the rest of the output.

Hallucination

When a model confidently states something that’s completely wrong. This is the biggest problem with LLMs. Ask it “Who wrote book X?” and if it doesn’t know, it will invent an author’s name and deliver it with full confidence — as if it were fact.

For serious projects, this is extremely dangerous. Imagine an AI recommending an incorrect exercise program. The main solutions: RAG (providing the model with real information) and “say I don’t know” prompting (training the model to admit uncertainty when it isn’t sure).

Temperature

A parameter that controls how “creative” the model is. Zero means it always picks the most probable answer — completely predictable. A high value (e.g., 1.5) means diverse, creative responses — though they might get strange.

For projects where accuracy matters (like RAG, information extraction), set Temperature to zero or near-zero. For creative writing, 0.7 to 0.9 works well. Temperature above 1 usually produces low-quality output — exciting but unreliable.

Logits, Softmax, and Top-k — The Word Selection Mechanism

When a model needs to choose the next word, it first assigns a raw score (Logit) to every word in its vocabulary. Then Softmax converts these raw scores into probabilities (summing to 1). For example: “hello” 0.4, “hi” 0.2, “hey” 0.1, and so on. Then it selects one based on these probabilities.

Top-k is a filter: it keeps only the k words with the highest probabilities and discards the rest. For instance, Top-k=50 means “choose only from the top 50 candidates.” This makes output more coherent. Top-p (or Nucleus Sampling) works similarly but filters based on cumulative probability.

Analogy

The LLM’s word generation process is like an election. Every word gets a vote (Logit), Softmax calculates vote percentages, Temperature determines how much “chance” plays a role, and Top-k eliminates weak candidates.

3. Notable Models

Terms 28 to 37

Now that you understand how LLMs work, let’s see who the major players are. New models are announced every month, but this list covers the most important ones you should know in 2026. We’ll divide them into Open-Source and Closed-Source categories.

Open-Source vs. Closed-Source

Closed-Source: You can’t download the model. You can only use it through an API (e.g., GPT-5, Claude). Your data goes to their servers. Open-Source: You can download the model, run it on your own server, and even modify it (e.g., Llama, Qwen, DeepSeek).

For serious projects — especially when data privacy matters or you want to fine-tune the model — open-source is often the better choice. Open-source models have gotten remarkably close to closed-source ones, which is one of the biggest shifts of 2024-2026.

Feature	Open-Source	Closed-Source
Access	Download + run locally	API only
Cost	Your own GPU (or cloud)	Per-token pricing
Privacy	Data stays with you	Data goes to their servers
Fine-tuning	Fully possible	Limited
Examples	Llama 4, Qwen 3, DeepSeek V4	GPT-5, Claude Opus 4.7

Warning

Choosing between Open and Closed isn’t just a technical decision. If you handle sensitive data (medical, financial, legal), Open-Source offers a privacy advantage — your data stays on your servers. However, Closed-Source models have easier APIs and are usually still slightly more capable.

GPT — OpenAI’s Model Family

Generative Pre-trained Transformer. The family of models that changed the world. GPT-3 (2020) stunned the world, ChatGPT (November 2022) became the fastest-growing digital product in history (1 million users in 5 days), and GPT-5 (2025) is now the frontier of closed-source models.

OpenAI offers a range of models: GPT-5 (most powerful), GPT-5.5 (newest), and the o-series models for reasoning. Access is only through the API and ChatGPT — the models are not open-source.

Claude — Anthropic’s Model Family

Models built by Anthropic. Claude is known for its safety, accuracy, and large Context Window. Claude Opus 4.7 (the latest flagship model) is one of the most powerful models available. Claude also comes in Sonnet (faster and cheaper) and Haiku (smallest and fastest) variants.

Anthropic takes a distinctive approach to AI safety and trains Claude using a technique called Constitutional AI. Access is available through the API and claude.ai.

Gemini — Google’s Model Family

Google’s models. Gemini is natively multi-modal — meaning it was designed from the ground up to understand text, images, audio, and video simultaneously. Gemini 2.5 (the latest version) has a very large Context Window and competitive performance against GPT-5 and Claude.

Google has integrated Gemini across all its products — from Search to Android to Google Workspace.

Llama — Meta’s Open-Source Model

Meta’s (Facebook’s parent company) open-source model family. When Llama was released, it changed the game — it was the first time a high-quality large model was freely available. Llama 4 (the latest) is a serious competitor to closed-source models.

Llama is well-suited for Fine-tuning and has a large community. If you’re just getting started with open-source models, Llama is one of the best choices. Read the Qwen vs. Llama comparison to see which performs better for non-English languages.

Qwen — Alibaba’s Open-Source Model

Models built by Alibaba Cloud. Qwen 3 (the latest version) has become very powerful, and it performs especially well for non-English languages. Qwen’s tokenizer is better optimized for Asian and Middle Eastern languages.

Qwen is available in various sizes (0.5B to 72B+) and is one of the best options for fine-tuning on non-English tasks.

DeepSeek — Chinese Open-Source Model

Models built by the Chinese company DeepSeek. The V4 version (with 1.6 trillion parameters) is one of the most powerful open-source models in the world. DeepSeek uses the MoE (Mixture of Experts) architecture, which allows it to deliver high performance with fewer active resources.

DeepSeek is particularly strong in coding and mathematics. DeepSeek-R1 (its reasoning model) has also attracted significant attention.

Mistral — French Open-Source Model

French company Mistral AI builds models with an outstanding size-to-performance ratio. Mistral Large (their latest big model) is one of the strongest open-source models available. Mistral also offers smaller models like Mixtral and Mistral 7B.

Mistral’s main advantage: their small models deliver high quality, making them excellent for getting started and experimenting. If you have limited GPU resources, Mistral models are a solid choice.

BERT — Google’s Text Understanding Model

Bidirectional Encoder Representations from Transformers. A model Google released in 2018. Unlike GPT which “generates” text, BERT is designed for “understanding” text. BERT reads text in both directions (left-to-right and right-to-left) to comprehend meaning.

BERT isn’t suitable for text generation, but it’s excellent for Classification, information extraction, and search. Many modern Embedding models are built on the BERT architecture. Google Search used BERT for years.

Multi-modal

A model that goes beyond just text. It can understand images, audio, and video too. GPT-5 and Claude Opus 4 can process text, images, and video. This means you can send a photo of a chart and say “analyze this” or send a food photo and ask “how many calories?”

The market trend is heading toward Multi-modal. Future models will all be Multi-modal. There are also specialized models: Image Models (like Midjourney, DALL-E), Audio Models (like Whisper for Speech-to-Text), and Video Models (like Sora and Veo).

Tip

If you want to fine-tune an open-source model for non-English tasks, start with Qwen or Llama. I’ve written a full comparison in the Qwen vs. Llama article.

4. Key Techniques & Concepts

Terms 38 to 52

Now that you know the major players, it’s time to learn the tools and techniques. This section covers concepts that everyone — business leaders and developers alike — should understand. From Prompt Engineering to Quantization, from Training to Inference.

Training

The process of teaching a model from data. During this phase, the model’s weights (parameters) are adjusted. This is the most expensive part — it can consume millions of dollars worth of GPU time. Training a model from scratch requires thousands of GPUs. We don’t do this — we use pre-trained models.

When Meta releases the Llama model, it’s already been trained. That means millions of dollars have been spent. We simply use it (Inference) or Fine-tune it.

Pre-training

The first phase of Training. The model is trained on a massive volume of internet text. It learns just one task: “Given the preceding words, predict the next word.” It’s that simple. But performing this simple task across trillions of words transforms the model into an extraordinarily capable system.

After Pre-training, the model still isn’t ready for use. Additional stages (SFT and RLHF) are needed to make the model “helpful” and “safe.”

Inference

Using the model after it’s been trained. When you give the model input and get output, that’s Inference. Every time you talk to ChatGPT, you’re doing Inference. This part is cheaper but still significant — all the work a GPU server does in production is Inference.

The key difference: Training is like building a factory — expensive, done once. Inference is like manufacturing products — cheaper per unit but done continuously. As a user, you deal with Inference and Fine-tuning, not Training from scratch.

Warning

Many people confuse Training and Fine-tuning. Training from scratch: millions of dollars. Fine-tuning: a few dollars to a few hundred dollars. Inference: just the cost of running. If someone says “I trained the model myself,” they mean Fine-tuned — not Training from scratch.

Analogy

Inference is like driving a car. Training was building the car in a factory (done once, very expensive). Now you’re riding in it and using it (daily, just fuel costs). Fine-tuning is like customizing the car — some aftermarket modifications to suit your needs.

Prompt Engineering

The art of crafting effective prompts to get the best results from a model. It includes techniques like Few-shot (providing a few examples), Chain-of-Thought (asking for step-by-step reasoning), and Role-playing (assigning a persona to the model). A good prompt can dramatically improve output quality.

I wrote a complete Prompt Engineering guide. If you only learn one skill from the AI world, make it this one — it works even without any coding.

System Prompt

An instruction given to the model before the user’s message that defines its overall behavior. For example: “You are a customer support assistant. Only answer questions about our products. Be polite. If you don’t know, say you don’t know.”

The System Prompt is the most important tool for controlling model behavior in real-world projects. In Agent and RAG projects, the System Prompt defines the model’s role, constraints, and response format. Without a good System Prompt, your project’s behavior is unpredictable.

Fine-tuning

Taking a pre-trained model and training it a bit further on your own specific data to make it better at your particular task. Much cheaper than Training from scratch. You’ll do a lot of this in the Practical Fine-tuning series.

Fine-tuning is great for changing tone, learning specific formats, and specializing a model. But for “giving new information” to a model, RAG is better. Combining both (Fine-tune + RAG) yields the best results.

Quantization

A technique for shrinking a model by reducing the precision of its parameters. For example, instead of each parameter taking 2 bytes (FP16), it can take half a byte (INT4). The model becomes 4x smaller while quality drops only slightly. This technique is critical for running large models on limited hardware.

With Quantization, you can run a 70B model that normally requires 140 GB of VRAM with just 35 GB (at INT4). Popular tools for Quantization include GGUF and AWQ.

Format	Size per parameter	7B model size	Quality
FP32	4 bytes	~28 GB	Highest (reference)
FP16 / BF16	2 bytes	~14 GB	Virtually identical to FP32
INT8	1 byte	~7 GB	Slight reduction
INT4	0.5 bytes	~3.5 GB	Acceptable for most tasks

Practical Tip

Take Quantization seriously. If you run a 7B model with INT4, it fits on an RTX 3060 (12GB). You get 90-95% of the original model’s quality. For getting started and experimenting, it’s the best choice.

Emergent Abilities

One of the most surprising discoveries in AI. When a model gets large enough, capabilities suddenly appear that nobody explicitly trained it for. For instance, GPT-2 (2019) couldn’t solve math. GPT-3 (2020), which was simply larger, could suddenly do it.

Nobody knows exactly why this happens. Even researchers at OpenAI and Anthropic describe it as “empirical” — meaning “we see it happening but don’t know why.” This means developing with LLMs is somewhat “unpredictable.” You should be prepared for things that you expected to work failing, and vice versa.

Knowledge Cutoff

Every model was trained up to a specific date and knows nothing after that. For example, if your model was trained through March 2026, it’s unaware of any events since then. This is one of the primary reasons RAG exists — to provide the model with up-to-date information.

Knowledge Cutoff isn’t just about dates. The model also knows nothing about your private data — your products, prices, documentation. This is where RAG plays a vital role.

Zero-shot / Few-shot / Many-shot

Different approaches to using a model based on how many examples you include in the Prompt. Zero-shot: No examples — just the instruction. Few-shot: 2-5 examples — the model picks up the pattern. Many-shot: Dozens of examples — for when accuracy is critical.

Few-shot is one of the simplest and most effective Prompt Engineering techniques. Instead of saying “give me the answer in this format,” show a few examples — the model understands what you want much better.

Chain-of-Thought (CoT)

A technique where you ask the model to “think step by step” before giving an answer. Instead of saying “give me the answer,” you say “reason through this step by step, then give me the answer.” This significantly improves accuracy, especially on math and logic problems.

OpenAI’s o-series models and Claude with thinking mode have built-in Chain-of-Thought — they “think” before answering. But even without these specialized models, simply adding “Let’s think step by step” to your prompt improves results.

RLHF — Reinforcement Learning from Human Feedback

A stage that comes after Pre-training and SFT (Supervised Fine-Tuning). Humans rank the model’s responses (which is better?) and the model learns from this feedback what a “good answer” looks like.

RLHF is what made ChatGPT “polite” and “helpful.” Without it, the model might give inappropriate, dangerous, or irrelevant answers. DPO (Direct Preference Optimization) is a simpler, cheaper alternative that we cover in the Practical Fine-tuning series.

API — Application Programming Interface

The way you access a model through code. When we say “use the model’s API,” it means instead of chatting directly, you send requests via code and get responses back. OpenAI’s API, Anthropic’s API — they all work this way.

The API matters because it enables automation. Through ChatGPT, only one person can ask questions at a time. Through an API, you can send thousands of requests per second. Read the Python for AI article if you want to get started.

Tokenizer

The algorithm that converts text into tokens. Each model has its own tokenizer. For instance, GPT-4’s tokenizer might split a non-English word into 3 tokens while Qwen’s tokenizer produces only 2 for the same word. This means Qwen processes that language more “efficiently.”

The tokenizer has a direct impact on both cost (since pricing is per-token) and quality. A model that encodes a language with fewer tokens usually understands that language better too.

Benchmark

Standardized tests that compare model performance. Examples include MMLU (general knowledge), HumanEval (coding), and GSM8K (math). When OpenAI says “GPT-5 outperforms Claude,” they’re referring to benchmark scores.

But read benchmarks with caution. A model that excels on benchmarks might perform poorly in real-world projects — especially when non-English languages are involved. The best test is your own test on your actual project data. Over-reliance on benchmarks is one of the common AI mistakes.

Where Are We Now?

You’ve read more than half of this AI glossary. You’ve learned the fundamentals, gotten to know the models and techniques. Now let’s move on to RAG — the technology at the heart of commercial AI projects.

5. RAG — Retrieval-Augmented Generation

Terms 53 to 64

RAG is arguably the most important technology in commercial AI. If you want to build a real AI product — not just a demo — you need RAG. I’ve already written a comprehensive RAG article, and we also have the RAG Zero to Production series. Here’s a summary of the key terms.

RAG — Retrieval-Augmented Generation

Instead of expecting the model to know everything from memory, you find the relevant information at query time and put it in front of the model. Three stages: Retrieval (finding relevant information), Augmentation (adding it to the Prompt), Generation (the model creates an answer using this information).

Simple analogy: think of a brilliant doctor who has no memory but you place the patient’s file in front of them every time. The doctor has medical knowledge (= the language model), the file contains the patient’s information (= retrieved data), and combining them produces an accurate diagnosis (= the final answer). Be sure to read the complete RAG article.

Vector Database

A database optimized for storing and rapidly searching Embeddings. Unlike a traditional database (like MySQL) that performs exact keyword matching, a Vector DB can find the “nearest vectors” — enabling semantic search.

The most popular options: Qdrant (open-source, fast), Chroma (simple, good for getting started), Pinecone (managed), pgvector (if you’re already using PostgreSQL). Read the Vector Database article to choose the best option.

Chunking

The process of splitting large documents into smaller pieces for storage in a Vector DB. How you chunk directly affects RAG quality. If chunks are too small, there’s insufficient context. Too large? Too much noise.

Different methods: Fixed Size (every 500 characters), Recursive (based on natural text boundaries), Semantic (based on topic changes), Document-based (based on document structure). Rule of thumb: 200 to 1,000 tokens with about 50 tokens of overlap.

Cosine Similarity

The most common method for measuring similarity between two vectors. It returns a number between 0 and 1. 1 means completely similar, 0 means no similarity at all. When searching for the “nearest” vector in RAG, you’re typically calculating Cosine Similarity.

Why Cosine and not regular (Euclidean) distance? Because Cosine looks at the “direction” of the vector, not its “magnitude.” Two sentences with similar meanings but different lengths will have high Cosine Similarity but potentially large Euclidean distance.

Semantic Search

Searching by meaning, not by keywords. If a user asks “How do I return a product?” and your documentation says “Return and refund policy,” keyword search won’t connect the two. But Semantic Search understands they share the same meaning.

Semantic Search is the combination of Embedding + Vector Search. You embed the query text and search the Vector DB for the nearest vectors. This is the foundation of Retrieval in RAG.

Think About It

Why is Semantic Search better than keyword search? Suppose you wrote “Tuesday morning workout: 30-minute run” and the user asks “What’s my exercise plan for tomorrow?” There are zero overlapping keywords. But Semantic Search understands both are about exercise and scheduling.

Hybrid Search

Combining keyword search (like BM25) with Semantic Search. Each has weaknesses on its own: keyword search doesn’t understand meaning; semantic search sometimes misses exact terms. Combining them usually outperforms either alone.

For example, if a user asks “iPhone 15 Pro Max price,” semantic search might return generic results about phones, while keyword search will precisely find “iPhone 15 Pro Max.” Hybrid Search combines both approaches.

Re-ranking

After the initial search retrieves 20-30 results, a separate model re-ranks them. Re-rankers are typically more accurate but slower — which is why they only run on the initial results, not the entire database.

Cohere Reranker and BGE-Reranker are among the most popular. Adding Re-ranking to your RAG pipeline has a significant impact on result quality.

Metadata

Additional information stored alongside each chunk: source, date, author, category. Metadata is crucial because it lets you filter searches. For example: “only search documents from 2026” or “only search the technical section.”

Without good Metadata, RAG operates blindly. Metadata also helps the model show the source of its answer (Citation) — which is critical for user trust.

Analogy

Metadata is like labels on filing cabinet folders. Without labels, you have to search through every folder each time. With labels, you go straight to the right shelf.

Indexing

The process of preparing data for RAG. It involves: collecting documents, Chunking, converting to Embeddings, and storing in a Vector Database. This is done once (and updated whenever new data is added).

Indexing quality directly affects Retrieval quality. If Indexing is done poorly, no matter how good the model is, the answers will be weak. Garbage in, garbage out — it applies here too.

Citation

Showing the user the source of an answer. For example: “According to document X, the return policy is 7 days.” Citation builds user trust and enables answer verification. In RAG, Citations are extracted from the Metadata during the Retrieval stage.

Without Citation, users don’t know where the answer came from and can’t trust it. Especially in sensitive domains (medical, legal, financial), Citation is essential.

Grounding

The process of anchoring a model to reality. Without Grounding, the model may hallucinate. RAG is the primary method of Grounding — by providing real information to the model, you prevent it from making things up.

Grounding isn’t limited to RAG. It also includes connecting to external APIs (like real-time pricing databases), tools (like a calculator), and anything that keeps the model “connected to reality.”

Query Expansion

A technique for improving Retrieval. You rewrite or expand the user’s query before searching. For example, you transform “return policy” into “return policy OR refund OR exchange OR send back” to find more results.

You can even use the LLM itself for Query Expansion. Ask it to “rewrite this question 3 different ways” and then search with all of them. Multi-step RAG is similar — first generate a preliminary answer, then search again based on it.

Practical Tip

If you take one thing away from the RAG section, let it be this: Retrieval quality is the single most important factor in RAG success. If you feed the model wrong information, the answer will be wrong too. Spend 80% of your time improving Retrieval, not the LLM.

6. Fine-tuning

Terms 65 to 76

Fine-tuning means taking a ready-made model and specializing it for your task. The terms in this section are more technical, but if you plan to customize a model for your project, you need to know them. The Practical Fine-tuning series covers all of these in detail.

Analogy

Query Expansion is like searching Google not just for one keyword but for several synonyms. Don’t search for “return” alone — search for “return + refund + exchange + send back.” You’ll get more and better results.

Full Fine-tuning

You train all of the model’s parameters. The most precise method, but also the heaviest. For a 7B model, you need at least 80 GB of VRAM (because beyond the weights, gradients and optimizer states also need to stay in memory).

Full Fine-tuning yields the best results, but most people don’t use it because LoRA and QLoRA achieve similar quality with far fewer resources.

LoRA — Low-Rank Adaptation

Instead of training all parameters, LoRA adds small matrices (low-rank) and trains only those. It’s like annotating a book’s margins instead of rewriting the entire book. The training volume is much smaller (typically 1-2% of parameters), yet the results are very close to Full Fine-tuning.

LoRA was revolutionary for Fine-tuning. Before it, only large companies could afford to fine-tune models. Now you can do it with a consumer GPU.

QLoRA — LoRA + Quantization

Combining LoRA with Quantization. You load the model at 4-bit precision (using much less memory) and then apply LoRA on top. The result: you can Fine-tune a 7B model with just 6 GB of VRAM (a consumer-grade GPU).

QLoRA effectively democratized Fine-tuning. The Practical Fine-tuning series primarily uses QLoRA.

SFT — Supervised Fine-Tuning

A training stage that comes after Pre-training and before RLHF. The model is trained on examples of “question + good answer.” For instance: “Question: Where is Iran? Answer: Iran is a country in the Middle East…”

SFT is what transforms the model from a “text completer” into a “helpful assistant.” Without SFT, the model just continues text — it might extend your question rather than answer it.

DPO — Direct Preference Optimization

A simpler alternative to RLHF. Instead of training a separate reward model (which RLHF requires), DPO learns directly from pairs of “good answer + bad answer.” It’s simpler to implement and produces comparable results.

In Fine-tuning projects, DPO is typically used after SFT. You first teach the model to answer with SFT, then teach it what a “good answer” looks like with DPO.

Tip

DPO is simpler than RLHF and better suited for small teams. You only need pairs of “good answer + bad answer.” No need to train a separate Reward Model. In most projects, DPO is a solid replacement for RLHF.

Adapter

Small layers added to the base model without changing the model itself. LoRA is a type of Adapter. The big advantage: you have one base model and can mount multiple different Adapters on it — for example, one for a specific language, one for coding, one for customer support.

Adapters are like glasses. The base model is your eyes; the Adapter is the lens — swap it and you get a different perspective. Your eyes stay the same.

Epoch

One complete pass through the entire Dataset by the model = one Epoch. Fine-tuning is typically done in 1-3 Epochs. Too many? The model overfits. Too few? It doesn’t learn enough.

The optimal number of Epochs depends on Dataset size and task complexity. The best approach: monitor Training Loss — when it stops decreasing or Validation Loss starts rising, that’s enough.

Gradient

The direction and magnitude of change each parameter needs in order to reduce the model’s error. Backpropagation computes gradients, and the Optimizer uses them to update parameters. If Learning Rate is the step size, Gradient is the step direction.

Common problems: Gradient Vanishing (gradient too small — model doesn’t learn) and Gradient Exploding (gradient too large — model becomes unstable). Techniques like Gradient Clipping and Normalization address these issues.

Loss Function

A number that indicates “how wrong” the model is. The goal of Training: minimize Loss. When Loss decreases, the model is improving. If Loss isn’t going down, something’s wrong — the Dataset, Learning Rate, or architecture.

During Fine-tuning, you monitor Loss on both the Training Set and the Validation Set. If Training Loss decreases but Validation Loss increases, the model is overfitting.

Batch Size

The number of examples the model processes simultaneously before updating weights. Larger Batch Size = more stable training but more memory. Smaller Batch Size = less memory but noisier training.

When GPU memory is limited, you use Gradient Accumulation — accumulate several small batches and then update. The effect is like one large batch but with lower memory consumption.

Unsloth — Fast Fine-tuning Tool

A library that makes Fine-tuning up to 2x faster and uses 60% less memory. It applies specific optimizations to the Transformer architecture. Especially great for QLoRA.

Unsloth is ideal for people with limited GPU resources (like the free T4 on Google Colab). The Practical Fine-tuning series uses Unsloth.

Catastrophic Forgetting

When you overdo Fine-tuning, the model forgets its prior knowledge. For example, you fine-tune a model for a new language and it forgets English! Solutions: low Learning Rate, few Epochs, and using LoRA (which doesn’t modify the original parameters).

Catastrophic Forgetting is one of the reasons LoRA became popular. Because the original model weights remain untouched — only the added Adapters change.

Warning

Fine-tuning isn’t magic. Before fine-tuning, make sure you’ve tried Prompt Engineering and RAG. In many cases, those two are sufficient and Fine-tuning isn’t needed. Fine-tuning is necessary only when you want to change the model’s tone, style, or output format.

7. Agent

Terms 77 to 86

Agents are the hottest topic in AI for 2025-2026. An Agent doesn’t just answer — it makes decisions and takes actions. Your project is actually an Agent, not a simple chatbot. The Building AI Agents series covers all the details.

Agent

An LLM that can make decisions and take actions, not just respond. For example, an Agent can decide “I need to query the database now,” “I should perform this action,” or “I need to send a message to the user.” The difference from a chatbot: a chatbot only responds; an Agent decides and executes.

Agents can use tools (Tools), have memory (Memory), and perform multi-step tasks. Check out the Building AI Agents series for hands-on learning.

Agent Loop

The core operating pattern of an Agent. A repeating cycle: “Think (Reason) → Decide (Act) → See the result (Observe) → Think again.” The Agent repeats this loop until the task is complete or the final answer is ready.

For example, when an Agent receives a question: first it thinks “I need to search the database,” then calls the search tool, reviews the results, decides whether that’s sufficient or not, and if not, acts again.

Tool Use / Function Calling

An LLM’s ability to invoke external tools. For example, the model decides “I need to check the weather” and calls the get_weather() function. Then it receives the result and formulates its answer. Modern models (GPT-5, Claude, Qwen) all have this capability.

Tool Use is what separates an Agent from a chatbot. Without Tool Use, a model can only talk. With Tool Use, it can take action — search, calculate, call APIs, send emails.

MCP — Model Context Protocol

A protocol introduced by Anthropic to standardize how LLMs connect to tools and data sources. Before MCP, every company had its own approach. MCP is a common standard — like USB for devices, MCP is for AI tools.

MCP is still new but gaining adoption rapidly. If you’re building Agents, familiarity with MCP is valuable.

Memory — Agent Memory

Without memory, Agents start from scratch every time (like a goldfish). Memory comes in two types: Short-term Memory (the history of the current conversation, limited by the Context Window) and Long-term Memory (information that persists across different sessions, typically stored in a database).

Implementing Long-term Memory is one of the main challenges in building Agents. You need to know what to store, when to forget, and how to retrieve it.

Planning

An Agent’s ability to break a large task into smaller steps. For example, decomposing “write an analytical report” into: 1) Collect data, 2) Analyze, 3) Create charts, 4) Write the report. More powerful LLMs have better Planning capabilities.

Planning is one of the hardest parts of building Agents. The model might plan poorly — skip steps, order them incorrectly, or get stuck in loops. That’s why good Agents should be “self-critical” and evaluate their own plans.

Multi-Agent

A system where multiple Agents collaborate. For example, a “Writer” Agent drafts content, an “Editor” Agent reviews it, and a “Critic” Agent provides feedback. Each has its own specialty, and together they produce better results.

Multi-Agent is still experimental and comes with its own complexities (coordination, cost, debugging). But for complex tasks, it delivers significantly better results than a single Agent.

Human-in-the-Loop

Designing the system so a human approves actions at critical points. The Agent should get approval before sending an email. Before making a purchase, get approval. This “semi-automated” model is the optimal approach for most real-world projects — especially when decision risk is high.

Human-in-the-Loop isn’t just about Agents. In Fine-tuning, when humans review and correct answers, that’s Human-in-the-Loop. In RAG, when users give “answer was helpful/not helpful” feedback, that’s Human-in-the-Loop too.

Guardrails

Mechanisms for constraining Agent/LLM behavior. For example: “Never give financial advice,” “Call the API at most 3 times,” “If you’re not sure, ask.” Guardrails are implemented both in the System Prompt (soft) and in code (hard).

Without Guardrails, an Agent might do unexpected things — enter infinite loops, generate excessive costs, or give inappropriate answers. Every Agent should have at minimum a step limit and a timeout.

Orchestration

Managing and coordinating the workflow between LLM, tools, data, and APIs. Frameworks like LangChain, LlamaIndex, and CrewAI are orchestration tools. They help you build AI pipelines without writing everything from scratch.

Orchestration is important, but don’t become too dependent on any framework. Understanding the concepts matters more than the framework — frameworks change, concepts remain.

Analogy

An Agent is like a smart employee: it has tools (Tool Use), memory (Memory), plans ahead (Planning), and asks the manager for approval when it’s unsure (Human-in-the-Loop). Guardrails are like company policies — boundaries it shouldn’t cross.

8. Architecture & Infrastructure

Terms 87 to 96

So far you’ve learned about concepts, models, and techniques. But all of these run on hardware and infrastructure. This section covers the terms you need to know when “running a model” comes up. Don’t worry — you don’t need to become an infrastructure engineer. Just understand what the terms mean.

GPU — Graphics Processing Unit

Originally built for gaming and graphics, but it turned out to be excellent for parallel computation (like training neural networks). NVIDIA is the undisputed leader in AI GPUs. The A100, H100, and H200 series cards are industry standards.

Why GPU and not CPU? Because a GPU has thousands of small cores that work simultaneously. Neural network training requires billions of matrix multiplications — a GPU performs these thousands of times faster than a CPU.

VRAM — Video RAM

The GPU’s dedicated memory. When we say “this model needs 24 GB of VRAM,” it means you need a graphics card with at least 24 GB of memory. VRAM is usually the bottleneck — it’s not the GPU’s speed but its memory that’s the limiting factor.

Consumer cards (like the RTX 4090) max out at 24 GB of VRAM. Server cards (like the A100) go up to 80 GB. That’s why Quantization is so important — it shrinks the model to fit in available VRAM.

Latency and Throughput

Latency: The time it takes for the first token of the response to appear (Time to First Token). Users shouldn’t have to wait too long. Throughput: The number of tokens generated per second. Both matter, but depending on the use case, one takes priority.

For chatbots, Latency matters more (users shouldn’t wait 5 seconds for a response). For batch processing (like analyzing thousands of emails), Throughput matters more.

Model Serving

The process of running a model and exposing it as a service (typically an API). Tools like vLLM (most popular), TGI (HuggingFace), and Ollama (simplest for local use) are built for this purpose.

Ollama is great for local experimentation — with a single command, it downloads and runs a model. For production, vLLM is better because it includes many optimizations (like Continuous Batching and PagedAttention).

Edge AI

Running an AI model on the user’s device (mobile, IoT, laptop) instead of in the Cloud. Advantages: high speed (no network latency), privacy (data never leaves the device), and offline capability. Limitation: limited computational power.

Apple Intelligence on iPhone is an example of Edge AI. Small models (3B-7B) with Quantization can run on phones. The market trend is moving toward a hybrid Edge + Cloud approach.

Did You Know?

Edge AI isn’t just about mobile phones. Self-driving cars, security cameras, and even smart refrigerators all use Edge AI. Anywhere a model runs on the device itself (not in the Cloud) is Edge AI.

MoE — Mixture of Experts

An architecture that divides the model into several “Experts.” For each input, only one or two experts are activated (not all). The result: the model can be very large (e.g., 1.6T parameters like DeepSeek V4) but only a small portion is active at any time — so its speed is comparable to much smaller models.

MoE is used by the best. GPT-4 is also likely MoE (OpenAI hasn’t confirmed it, but strong evidence suggests so).

Distillation

The process of creating a small model from a large model. The large model (Teacher) generates answers, and the small model (Student) learns to respond like the teacher. The result: a smaller, faster model whose quality is close to the large one.

Many popular small models (like Phi and Gemma) were built through Distillation. If you want a fast, affordable model, Distillation is an option.

GGUF — Model File Format

A file format for storing quantized models. llama.cpp and Ollama use GGUF. When you see “GGUF” while browsing HuggingFace, it means the model is ready for local execution.

GGUF replaced the older GGML format. Its advantage: single file, simplest way to run a model on CPU or limited GPU.

Scaling Laws

An important discovery showing that model performance improves predictably as you increase three things: model size (more parameters), data volume (more training data), and compute (more GPU hours). These laws were discovered by OpenAI and DeepMind.

Scaling Laws are why companies keep building larger and larger models — they know bigger = better (up to a point). The “up to a point” part is important — we may hit a ceiling in the future.

HuggingFace — The Model Hub

The largest platform for sharing models, datasets, and AI tools. Like GitHub is for code, HuggingFace is for models and data. You can find virtually any open-source model here.

HuggingFace also builds the transformers library — the most important Python library for working with AI models. If you work with open-source models, HuggingFace is your second home.

Tip

You don’t need to know all of these hands-on. If you’re a business leader, it’s enough to know what GPU and VRAM are, what Quantization means, and what running a model actually costs. If you’re a developer, you’ll learn all of these practically in the AI Development Zero to Hero series.

9. Business & Applications

Terms 97 to 108

The final section of this AI glossary, but perhaps the most important for many of you. Terms you need to know when bringing AI into real business. If you’re a manager, read this section twice. I’ve also written the AI for Managers series for you.

AI Readiness

Assessing how prepared your organization is to implement AI. This includes: data quality, technical infrastructure, team skills, organizational culture, and budget. Many AI projects fail not because of technology, but because the organization wasn’t ready.

Before starting any AI project, conduct an AI Readiness assessment. If your data lives in scattered spreadsheets and your team isn’t familiar with APIs, fix the infrastructure first. The article “Why Not Every Business Needs AI” explores this in depth.

POC — Proof of Concept

A small experimental project to prove that an idea is feasible. Before spending 6 months and significant resources, build a POC. 2-4 weeks, limited scope, clear goal: “Can AI solve this problem?”

Many AI projects should start as POCs. For example, before building a complete AI support system, build a POC that only answers 10 common questions. If the results are good, scale it up.

Use Case

A specific scenario that AI is supposed to solve. “Customer support” is not a Use Case — it’s too vague. “Automatically answering 20 frequently asked questions about product returns” is a good Use Case — specific, bounded, and measurable.

Defining a good Use Case is the most important first step of any AI project. Vague Use Case = failed project. Specific Use Case = high chance of success.

100

Vendor Lock-in

When your entire system becomes dependent on one specific provider and you can’t easily switch. For example, if you build everything on OpenAI’s API and one day prices increase 10x or the service goes down, you’re stuck.

The solution: design your architecture so swapping models is easy. Use an abstraction layer. Run part of your system with open-source models. Don’t put all your eggs in one basket.

Warning

Vendor Lock-in is one of the biggest risks in AI projects. OpenAI changes prices weekly. Anthropic might deprecate APIs. Always have a Plan B. The 10 Common AI Mistakes article covers this as well.

101

TCO — Total Cost of Ownership

The real, complete cost of an AI project. It’s not just the API cost. It includes: development, maintenance, infrastructure, monitoring, data updates, team training, and support. Many managers only see the API cost and get shocked later.

An example: monthly API cost might be $500, but the developer who maintains it costs $3,000/month. The actual TCO is 7x the API cost. The AI for Managers series covers this in detail.

Analogy

TCO is like the real cost of owning a car. The purchase price might be $30,000, but insurance + fuel + maintenance + parking adds up to $500/month. After 5 years, the real cost is $60,000 not $30,000. AI is the same — API costs are only part of the TCO.

102

ROI — Return on Investment

How much profit has your AI investment generated? Calculating ROI for AI is challenging because some benefits are qualitative (customer satisfaction, speed) and not directly measurable.

Recommendation: before starting, define success metrics (KPIs). For example: “reduce response time from 24 hours to 2 minutes” or “handle 30% of calls without a human operator.” Measure these afterward.

103

MVP — Minimum Viable Product

The simplest version of your AI product that actually works and can be shown to users. After the POC (proof of concept), the MVP is the next step — a real product but with minimal features.

MVP is especially important for AI because you get to see how the system behaves with real users. It might work perfectly in internal testing, but real users will ask questions you never anticipated.

104

Deployment

Getting the model from the development environment to the real world (Production). This includes: choosing infrastructure, optimizing speed, monitoring, error handling, and updates. Many AI projects fail at Deployment — not during development.

Deployment isn’t just “uploading code.” Latency must be acceptable, costs must be reasonable, and the system needs to run 24/7 without interruption. Model monitoring is also important — model performance can degrade over time (Model Drift).

Practical Tip

Plan for Deployment from the start. Many people build their model in a Jupyter Notebook and then wonder “How do I get this to Production?” From day one, think about: API, monitoring, automated updates, and cost.

105

Model Drift

When a model’s performance degrades over time. Why? Because the world changes. New products launch, prices change, customer behavior shifts. A model that performed brilliantly 6 months ago might underperform today.

The solution: continuous monitoring + periodic data updates. Especially in RAG, data must be regularly refreshed.

106

Responsible AI

A set of principles for ethical AI development: transparency (why did it make this decision?), fairness (no bias), privacy (user data is protected), and accountability (someone is responsible for model output). Especially important when AI makes decisions about people.

AI regulations in Europe (AI Act) and other regions are becoming stricter. The sooner you take Responsible AI seriously, the better.

107

Bias

When a model discriminates against a particular group. Bias typically comes from training data — if the data is biased, the model will be too. For example, a model trained mostly on English text might perform worse on other languages — that’s a form of linguistic bias.

Bias isn’t limited to race and gender. It can be geographic, linguistic, economic, or cultural. In projects targeting specific regions or languages, model bias toward English language and Western culture is a serious challenge.

108

Tokens per Dollar — Token Economics

A metric for comparing model costs. For example, GPT-5 might charge $10 per million input tokens while Claude Sonnet charges $3. But price alone isn’t everything — output quality, speed, and Context Window size also matter.

Prices are constantly dropping. In 2023, 1 million tokens cost about $60; now it might be $3. This trend continues and AI gets cheaper every month — great news for businesses.

What’s Your Next Step?

If you’re a business leader, the AI for Managers series is the best starting point. If you’re a developer, follow the AI Development Zero to Hero series. And if you want to learn RAG or Fine-tuning hands-on, check out the RAG Zero to Production and Practical Fine-tuning series.

Summary

This AI glossary, with over 100 terms, is a comprehensive reference for entering the world of artificial intelligence. We started with the most foundational concepts like Machine Learning and Neural Networks, moved on to LLMs and notable models, explored techniques, learned about RAG, Fine-tuning, and Agents, and finished with business concepts.

One thing to remember: knowing the terminology is just the beginning. What matters is getting your hands dirty and working practically. Build a simple RAG system. Fine-tune a model. Create a basic Agent. Each of these will teach you ten times more than reading alone.

This Page Is a Living Document

The AI world changes every week. This AI glossary is regularly updated — new terms are added and explanations are improved. If you notice a term that should be here but isn’t, let me know.

If you found this helpful, check out the educational series as well:

AI Development Zero to Hero — for developers
RAG Zero to Production — the most in-depth RAG tutorial
AI for Managers — no code, just concepts and strategy
Building AI Agents — from simple Agents to Multi-Agent systems
Practical Fine-tuning — LoRA, QLoRA, DPO in detail