Three Stages of Building a Model — Pre-training, SFT, RLHF

Episode 2 18 min

A Language Model’s Journey from Zero to Usable

When you talk to a model like ChatGPT or LLaMA, you’re talking to the final product of a three-stage process. Each stage does a specific job, and without any one of them, the model wouldn’t be what you know today.

Think of it like training a new hire at an organization: first they go to university (Pre-training), then they do an internship (SFT), and then they receive customer feedback to improve their work (RLHF).

Stage 1: Pre-training — Read Everything

This is where the model starts from zero. It reads trillions of tokens of text from the internet, books, papers, and code, and learns to predict the “next word.” That’s it — just predicting the next word.

# Pre-training concept — simplified
# The model learns: given previous words, what's the next word?

text = "Artificial intelligence has made remarkable ___ in recent years"
# The model should predict: "progress" or "advances"

# Loss function: Cross-entropy
# Goal: minimize next-word prediction error
for batch in dataloader:
    input_ids = batch["input_ids"]      # Input words
    labels = batch["labels"]            # Actual next word
    
    outputs = model(input_ids)
    loss = cross_entropy(outputs, labels)
    loss.backward()
    optimizer.step()

Pre-training Specs

  • Data volume: Trillions of tokens (e.g., LLaMA 3 was trained on 15 trillion tokens)
  • Cost: Millions of dollars (A100/H100 GPUs running for weeks)
  • Result: A model that understands language but still doesn’t know how to respond properly
  • Will you do this? No! Unless you have a multi-million dollar budget
Important note: After pre-training, the model is still like an educated but inexperienced person. If you ask it a question, it might continue the text instead of answering. For example, ask “What is the capital of France?” and it might write “What is the capital of France and what features does it have? In this article…”

Stage 2: SFT — Teach It How to Respond

Supervised Fine-Tuning (SFT) is the stage where the model learns to answer questions. You show it thousands of question-answer pairs and say “when someone asks this, respond like that.”

# SFT data format
sft_data = [
    {
        "instruction": "What is the capital of France?",
        "output": "The capital of France is Paris."
    },
    {
        "instruction": "Write a Python function that finds prime numbers.",
        "output": """def is_prime(n):
    if n < 2:
        return False
    for i in range(2, int(n**0.5) + 1):
        if n % i == 0:
            return False
    return True"""
    },
    {
        "instruction": "What's the difference between a list and a tuple in Python?",
        "output": "A list is mutable, meaning you can modify it after creation. A tuple is immutable and cannot be changed after creation..."
    }
]

A more common format is the conversation format:

# Chat/Conversation format
chat_data = [
    {
        "messages": [
            {"role": "system", "content": "You are a helpful AI assistant."},
            {"role": "user", "content": "Explain bubble sort."},
            {"role": "assistant", "content": "Bubble sort is a simple algorithm..."}
        ]
    }
]

SFT Specs

  • Data volume: Thousands to hundreds of thousands of examples
  • Cost: Manageable (a few hours to a few days with a good GPU)
  • Result: The model knows how to respond and follow instructions
  • Will you do this? Yes! This is the standard fine-tuning

When we say "fine-tuning," we usually mean this SFT stage. You're taking a pre-trained model and doing Supervised Fine-Tuning on your own data.

Stage 3: RLHF — Learn What's Good and What's Bad

RLHF stands for Reinforcement Learning from Human Feedback. In this stage, humans evaluate the model's responses and the model learns to produce responses that humans prefer.

The RLHF Process

# Step 1: Collect human preferences
# The model generates two answers, a human picks the better one

preferences = [
    {
        "prompt": "What's the difference between TCP and UDP?",
        "chosen": "TCP is a connection-oriented protocol that...",    # Better answer
        "rejected": "TCP and UDP are both protocols..."              # Worse answer
    }
]

# Step 2: Train a Reward Model
# A separate model that learns to score responses
reward_model = train_reward_model(preferences)

# Step 3: Optimize the main model with PPO
# The model tries to produce responses that the Reward Model scores highly
from trl import PPOTrainer

ppo_trainer = PPOTrainer(
    model=sft_model,
    reward_model=reward_model,
    ...
)
ppo_trainer.train()

Why RLHF Matters

After SFT, the model can respond but might:

  • Give inappropriate or dangerous answers
  • Hallucinate (generate things that aren't real)
  • Respond too long or too short
  • Use an inappropriate tone

RLHF largely fixes these problems.

RLHF Specs

  • Complexity: Very high — requires a separate Reward Model and PPO
  • Cost: High — both computationally and in human labor for evaluation
  • Will you do this? Probably not directly — but DPO (which we'll cover later) is a simpler alternative

DPO: A Simpler Alternative to RLHF

DPO (Direct Preference Optimization) has the same goal as RLHF but without needing a separate Reward Model. It learns directly from preference data. We'll cover it in detail in Episode 8.

# DPO is much simpler
from trl import DPOTrainer

# Just preference data needed — no Reward Model
trainer = DPOTrainer(
    model=sft_model,
    train_dataset=preference_data,  # chosen + rejected pairs
    ...
)
trainer.train()

Summary of the Three Stages

  • Pre-training: The model learns language (trillions of tokens, millions of dollars) — you don't do this
  • SFT: The model learns to respond (thousands of examples) — this is standard fine-tuning
  • RLHF/DPO: The model learns to respond better (preference data) — optional but useful

Where Do You Fit in This Process?

# Practical fine-tuning path
# 1. Pick a pre-trained model
base_model = "meta-llama/Llama-3.1-8B"          # Pre-trained

# 2. Or better: pick an Instruct model (already SFT'd)
instruct_model = "meta-llama/Llama-3.1-8B-Instruct"  # SFT'd

# 3. Fine-tune on your own data (continuing SFT)
my_model = fine_tune(instruct_model, my_data)

# 4. Optional: DPO for quality improvement
my_aligned_model = dpo_train(my_model, preference_data)
Practical tip: You typically start from an Instruct model (like Llama-3.1-8B-Instruct) and run SFT on your own data. This means you've already passed stages 1 and 2, and you're repeating stage 2 with your own data.

Comparing Base and Instruct Models

When you look at a model on Hugging Face, you usually see two versions:

  • Base: Only pre-trained (e.g., Llama-3.1-8B)
  • Instruct: Pre-training + SFT + RLHF (e.g., Llama-3.1-8B-Instruct)
# Practical difference between Base and Instruct
from transformers import pipeline

# Base model — just continues text
base = pipeline("text-generation", model="meta-llama/Llama-3.1-8B")
print(base("The capital of France"))
# Output: "The capital of France is Paris. Paris is the largest city..."

# Instruct model — understands the question and answers it
instruct = pipeline("text-generation", model="meta-llama/Llama-3.1-8B-Instruct")
print(instruct("What is the capital of France?"))
# Output: "The capital of France is Paris."

Summary

Now you know the journey a language model takes to become the product you use today. When we say fine-tuning, we're talking about the SFT stage — the point where you can, at a reasonable cost, mold the model to your needs.

In the next episode, we'll explore LoRA — a technique that lets you train only a small portion of the model without changing the whole thing. This dramatically reduces memory consumption.