DPO — A Simple Alternative to RLHF

The Problem with RLHF

In Episode 2 we saw that RLHF (Reinforcement Learning from Human Feedback) is the third stage of model training. But RLHF is very complex:

You need to train a separate Reward Model
PPO (the RL algorithm) is unstable and its hyperparameters are hard to tune
Three models need to be in memory simultaneously: the main model, the reference model, and the Reward Model
Training is slow and resource-intensive

DPO was created to achieve the same goal as RLHF through a much simpler method.

What Is DPO?

DPO stands for Direct Preference Optimization. The idea is: instead of training a separate Reward Model and then using RL, learn directly from preference data.

An analogy: RLHF is like hiring a food critic (Reward Model) and then training the chef based on the critic’s opinions (RL). DPO is like telling the chef directly “this dish is better than that one” and letting them figure it out.

Comparing RLHF and DPO

# RLHF — three stages
# Stage 1: Collect preferences (chosen/rejected)
# Stage 2: Train Reward Model
reward_model = train_reward_model(preference_data)
# Stage 3: Train model with PPO
ppo_trainer = PPOTrainer(model, reward_model, ...)
ppo_trainer.train()  # Unstable and slow

# DPO — one stage
# Just: train directly from preferences
dpo_trainer = DPOTrainer(model, train_dataset=preference_data, ...)
dpo_trainer.train()  # Simple and stable

DPO Data Format

For DPO, you just need chosen/rejected pairs:

# DPO dataset format
dpo_data = [
    {
        "prompt": "What's the difference between == and === in JavaScript?",
        "chosen": "== performs comparison with type coercion. For example, '5' == 5 returns true. But === performs strict equality comparison and checks both value and type. '5' === 5 returns false. It's recommended to always use ===.",
        "rejected": "== and === are both for comparison. === is better."
    },
    {
        "prompt": "How do I sort a list?",
        "chosen": "In Python you have two main approaches:\n\n1. The sort() method which modifies the original list:\nmy_list = [3, 1, 2]\nmy_list.sort()\n\n2. The sorted() function which returns a new list:\nmy_list = [3, 1, 2]\nnew_list = sorted(my_list)\n\nFor descending order: sort(reverse=True)",
        "rejected": "Use sort."
    }
]

Generating DPO Data

def generate_dpo_pairs(model, prompts, num_responses=4):
    """Generate chosen/rejected pairs"""
    dpo_data = []
    
    for prompt in prompts:
        # Generate multiple different responses
        responses = []
        for _ in range(num_responses):
            response = generate_response(
                model, prompt, 
                temperature=0.9,  # High diversity
                do_sample=True
            )
            responses.append(response)
        
        # Human or automated evaluation
        # Simplified here — in practice, humans should choose
        ranked = rank_responses(responses, prompt)
        
        # Best -> chosen, worst -> rejected
        dpo_data.append({
            "prompt": prompt,
            "chosen": ranked[0],     # Best
            "rejected": ranked[-1],  # Worst
        })
    
    return dpo_data

# Or use a stronger model for evaluation
def rank_with_llm(responses, prompt, judge_model="gpt-4"):
    """Rank responses using an LLM"""
    judge_prompt = f"""
    {len(responses)} answers have been given to the question below.
    Rank the answers from best to worst.
    
    Question: {prompt}
    
    Answers:
    """
    for i, resp in enumerate(responses):
        judge_prompt += f"\n[{i+1}] {resp}\n"
    
    judge_prompt += "\nRanking (best first): "
    
    ranking = call_llm(judge_prompt, model=judge_model)
    return parse_ranking(ranking, responses)

Implementing DPO with TRL

from unsloth import FastLanguageModel
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset

# 1. Load the SFT'd model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="./my-sft-model",  # Previously SFT'd model
    max_seq_length=2048,
    load_in_4bit=True,
)

# LoRA for DPO
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
)

# 2. Load DPO dataset
dataset = load_dataset("json", data_files="dpo_data.jsonl", split="train")

# 3. Format the data
def format_dpo(example):
    prompt_messages = [
        {"role": "user", "content": example["prompt"]},
    ]
    
    chosen_messages = prompt_messages + [
        {"role": "assistant", "content": example["chosen"]},
    ]
    
    rejected_messages = prompt_messages + [
        {"role": "assistant", "content": example["rejected"]},
    ]
    
    return {
        "prompt": tokenizer.apply_chat_template(
            prompt_messages, tokenize=False, add_generation_prompt=True
        ),
        "chosen": tokenizer.apply_chat_template(
            chosen_messages, tokenize=False
        ),
        "rejected": tokenizer.apply_chat_template(
            rejected_messages, tokenize=False
        ),
    }

dataset = dataset.map(format_dpo)

# 4. DPO configuration
dpo_config = DPOConfig(
    output_dir="dpo-output",
    num_train_epochs=2,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=5e-6,          # Much lower than SFT
    beta=0.1,                    # Key DPO parameter
    warmup_ratio=0.1,
    bf16=True,
    optim="adamw_8bit",
    logging_steps=10,
    save_strategy="epoch",
)

# 5. Start training
trainer = DPOTrainer(
    model=model,
    ref_model=None,        # Unsloth handles this automatically
    args=dpo_config,
    train_dataset=dataset,
    tokenizer=tokenizer,
)

trainer.train()

The Beta Parameter

Beta is the most important hyperparameter in DPO. It controls how far the model diverges from the reference model:

# Small beta (0.05) -> model changes more
# Large beta (0.5) -> model is more conservative, changes less

# Typical values:
beta_values = {
    0.05: "Large changes — high risk",
    0.1:  "Default — good balance",
    0.2:  "Moderate changes",
    0.5:  "Conservative — minimal changes",
}

# Start with 0.1 and increase if the model changes too much
dpo_config = DPOConfig(beta=0.1, ...)

When to Use DPO

After SFT: When the SFT’d model still gives unwanted responses
Improving tone: When you want the model to be more polite, precise, or concise
Reducing hallucination: When the model generates inaccurate information
Better formatting: When you want more structured output

When Not to Use DPO

When you haven’t done SFT yet — do SFT first, then DPO
When you don’t have enough preference data (at least 500 pairs)
When the main problem is lack of knowledge — DPO doesn’t add new knowledge

DPO vs RLHF: Practical Comparison

Implementation simplicity: DPO is much simpler — just a regular training loop
Stability: DPO is more stable — PPO can diverge
Memory usage: DPO uses less — no Reward Model needed
Quality: Research shows DPO achieves similar results to RLHF in most tasks
Speed: DPO is faster — one training phase instead of three

DPO is a good replacement for RLHF in nearly all scenarios. Unless you’re operating at very large scale (like training ChatGPT), where RLHF might perform slightly better.

The Complete Pipeline: SFT + DPO

# Complete fine-tuning pipeline

# Stage 1: SFT with instruction data
# (Episode 6 — with Unsloth)
sft_model = sft_train(base_model, instruction_data)

# Stage 2: Generate responses from the SFT'd model
# To build the DPO dataset
responses = generate_multiple_responses(sft_model, prompts)

# Stage 3: Evaluate and create preference pairs
preference_data = create_preference_pairs(responses)

# Stage 4: DPO
final_model = dpo_train(sft_model, preference_data)

# Result: a model that is both task-specific (SFT)
# and aligned with preferences (DPO)

Summary

DPO is a powerful and simple tool for improving model quality after SFT. It doesn’t need a separate Reward Model and directly optimizes the model using preference data (chosen/rejected pairs). If your SFT’d model is 80% there, DPO can help with that remaining 20%.

In the next episode, we’ll explore the specific challenges of fine-tuning for the Persian language — from tokenization issues to choosing the right model.