The Problem with RLHF
In Episode 2 we saw that RLHF (Reinforcement Learning from Human Feedback) is the third stage of model training. But RLHF is very complex:
- You need to train a separate Reward Model
- PPO (the RL algorithm) is unstable and its hyperparameters are hard to tune
- Three models need to be in memory simultaneously: the main model, the reference model, and the Reward Model
- Training is slow and resource-intensive
DPO was created to achieve the same goal as RLHF through a much simpler method.
What Is DPO?
DPO stands for Direct Preference Optimization. The idea is: instead of training a separate Reward Model and then using RL, learn directly from preference data.
An analogy: RLHF is like hiring a food critic (Reward Model) and then training the chef based on the critic’s opinions (RL). DPO is like telling the chef directly “this dish is better than that one” and letting them figure it out.
Comparing RLHF and DPO
# RLHF — three stages
# Stage 1: Collect preferences (chosen/rejected)
# Stage 2: Train Reward Model
reward_model = train_reward_model(preference_data)
# Stage 3: Train model with PPO
ppo_trainer = PPOTrainer(model, reward_model, ...)
ppo_trainer.train() # Unstable and slow
# DPO — one stage
# Just: train directly from preferences
dpo_trainer = DPOTrainer(model, train_dataset=preference_data, ...)
dpo_trainer.train() # Simple and stable
DPO Data Format
For DPO, you just need chosen/rejected pairs:
# DPO dataset format
dpo_data = [
{
"prompt": "What's the difference between == and === in JavaScript?",
"chosen": "== performs comparison with type coercion. For example, '5' == 5 returns true. But === performs strict equality comparison and checks both value and type. '5' === 5 returns false. It's recommended to always use ===.",
"rejected": "== and === are both for comparison. === is better."
},
{
"prompt": "How do I sort a list?",
"chosen": "In Python you have two main approaches:\n\n1. The sort() method which modifies the original list:\nmy_list = [3, 1, 2]\nmy_list.sort()\n\n2. The sorted() function which returns a new list:\nmy_list = [3, 1, 2]\nnew_list = sorted(my_list)\n\nFor descending order: sort(reverse=True)",
"rejected": "Use sort."
}
]
Generating DPO Data
def generate_dpo_pairs(model, prompts, num_responses=4):
"""Generate chosen/rejected pairs"""
dpo_data = []
for prompt in prompts:
# Generate multiple different responses
responses = []
for _ in range(num_responses):
response = generate_response(
model, prompt,
temperature=0.9, # High diversity
do_sample=True
)
responses.append(response)
# Human or automated evaluation
# Simplified here — in practice, humans should choose
ranked = rank_responses(responses, prompt)
# Best -> chosen, worst -> rejected
dpo_data.append({
"prompt": prompt,
"chosen": ranked[0], # Best
"rejected": ranked[-1], # Worst
})
return dpo_data
# Or use a stronger model for evaluation
def rank_with_llm(responses, prompt, judge_model="gpt-4"):
"""Rank responses using an LLM"""
judge_prompt = f"""
{len(responses)} answers have been given to the question below.
Rank the answers from best to worst.
Question: {prompt}
Answers:
"""
for i, resp in enumerate(responses):
judge_prompt += f"\n[{i+1}] {resp}\n"
judge_prompt += "\nRanking (best first): "
ranking = call_llm(judge_prompt, model=judge_model)
return parse_ranking(ranking, responses)
Implementing DPO with TRL
from unsloth import FastLanguageModel
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset
# 1. Load the SFT'd model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="./my-sft-model", # Previously SFT'd model
max_seq_length=2048,
load_in_4bit=True,
)
# LoRA for DPO
model = FastLanguageModel.get_peft_model(
model,
r=16,
lora_alpha=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
)
# 2. Load DPO dataset
dataset = load_dataset("json", data_files="dpo_data.jsonl", split="train")
# 3. Format the data
def format_dpo(example):
prompt_messages = [
{"role": "user", "content": example["prompt"]},
]
chosen_messages = prompt_messages + [
{"role": "assistant", "content": example["chosen"]},
]
rejected_messages = prompt_messages + [
{"role": "assistant", "content": example["rejected"]},
]
return {
"prompt": tokenizer.apply_chat_template(
prompt_messages, tokenize=False, add_generation_prompt=True
),
"chosen": tokenizer.apply_chat_template(
chosen_messages, tokenize=False
),
"rejected": tokenizer.apply_chat_template(
rejected_messages, tokenize=False
),
}
dataset = dataset.map(format_dpo)
# 4. DPO configuration
dpo_config = DPOConfig(
output_dir="dpo-output",
num_train_epochs=2,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=5e-6, # Much lower than SFT
beta=0.1, # Key DPO parameter
warmup_ratio=0.1,
bf16=True,
optim="adamw_8bit",
logging_steps=10,
save_strategy="epoch",
)
# 5. Start training
trainer = DPOTrainer(
model=model,
ref_model=None, # Unsloth handles this automatically
args=dpo_config,
train_dataset=dataset,
tokenizer=tokenizer,
)
trainer.train()
The Beta Parameter
Beta is the most important hyperparameter in DPO. It controls how far the model diverges from the reference model:
# Small beta (0.05) -> model changes more
# Large beta (0.5) -> model is more conservative, changes less
# Typical values:
beta_values = {
0.05: "Large changes — high risk",
0.1: "Default — good balance",
0.2: "Moderate changes",
0.5: "Conservative — minimal changes",
}
# Start with 0.1 and increase if the model changes too much
dpo_config = DPOConfig(beta=0.1, ...)
When to Use DPO
- After SFT: When the SFT’d model still gives unwanted responses
- Improving tone: When you want the model to be more polite, precise, or concise
- Reducing hallucination: When the model generates inaccurate information
- Better formatting: When you want more structured output
When Not to Use DPO
- When you haven’t done SFT yet — do SFT first, then DPO
- When you don’t have enough preference data (at least 500 pairs)
- When the main problem is lack of knowledge — DPO doesn’t add new knowledge
DPO vs RLHF: Practical Comparison
- Implementation simplicity: DPO is much simpler — just a regular training loop
- Stability: DPO is more stable — PPO can diverge
- Memory usage: DPO uses less — no Reward Model needed
- Quality: Research shows DPO achieves similar results to RLHF in most tasks
- Speed: DPO is faster — one training phase instead of three
DPO is a good replacement for RLHF in nearly all scenarios. Unless you’re operating at very large scale (like training ChatGPT), where RLHF might perform slightly better.
The Complete Pipeline: SFT + DPO
# Complete fine-tuning pipeline
# Stage 1: SFT with instruction data
# (Episode 6 — with Unsloth)
sft_model = sft_train(base_model, instruction_data)
# Stage 2: Generate responses from the SFT'd model
# To build the DPO dataset
responses = generate_multiple_responses(sft_model, prompts)
# Stage 3: Evaluate and create preference pairs
preference_data = create_preference_pairs(responses)
# Stage 4: DPO
final_model = dpo_train(sft_model, preference_data)
# Result: a model that is both task-specific (SFT)
# and aligned with preferences (DPO)
Summary
DPO is a powerful and simple tool for improving model quality after SFT. It doesn’t need a separate Reward Model and directly optimizes the model using preference data (chosen/rejected pairs). If your SFT’d model is 80% there, DPO can help with that remaining 20%.
In the next episode, we’ll explore the specific challenges of fine-tuning for the Persian language — from tokenization issues to choosing the right model.