Dataset Preparation — The Most Critical Part of Fine-Tuning

Episode 5 20 min

Why Data Is the Most Important Part

There’s a saying in machine learning: “Garbage in, garbage out.” If your data is bad, your model will be bad too — no matter how powerful your GPU is or what technique you use.

Fine-tuning is like teaching a smart student. If you give them bad textbooks, they’ll learn badly. If you give them good but few textbooks, that’s still better than giving them a pile of bad ones.

Golden rule of datasets: 1,000 excellent examples are better than 100,000 mediocre ones.

Common Dataset Formats

1. Instruction Format

The simplest format. An instruction and an output:

# Instruction format — each line is a JSON object
{"instruction": "Explain the difference between a list and a tuple.", 
 "output": "A list is mutable, meaning you can modify its elements after creation. A tuple is immutable and cannot be changed after creation."}

{"instruction": "Write a function that calculates factorial.", 
 "output": "def factorial(n):\n    if n <= 1:\n        return 1\n    return n * factorial(n - 1)"}

# With optional input
{"instruction": "Summarize the following text.", 
 "input": "Artificial intelligence is a branch of computer science that...", 
 "output": "Artificial intelligence is a field in computer science for..."}

2. Conversation Format

For chat models. Each example is a complete conversation:

# Chat/Conversation format
{
    "messages": [
        {"role": "system", "content": "You are a Python programming assistant."},
        {"role": "user", "content": "How do I read a CSV file?"},
        {"role": "assistant", "content": "To read a CSV file in Python, use the pandas library:\n\nimport pandas as pd\ndf = pd.read_csv('data.csv')\nprint(df.head())"},
        {"role": "user", "content": "What if the file uses tab delimiters?"},
        {"role": "assistant", "content": "Just change the sep parameter:\n\ndf = pd.read_csv('data.tsv', sep='\\t')"}
    ]
}

3. Completion Format

The simplest format — just raw text:

# Completion format — for text continuation
{"text": "### Question: What's the difference between HTTP and HTTPS?\n\n### Answer: HTTPS is the secure version of HTTP..."}

# Or without special structure
{"text": "In Python programming, a decorator is a function that takes another function..."}

How Much Data Do You Need?

Short answer: it depends. But here's a general guide:

  • Style/tone change: 500-1,000 examples
  • Simple new task: 1,000-3,000 examples
  • Complex task: 3,000-10,000 examples
  • New domain knowledge: 5,000-20,000 examples
  • New language: 10,000+ examples
# Dataset quality analysis script
import json
from collections import Counter

def analyze_dataset(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        data = [json.loads(line) for line in f]
    
    print(f"Number of examples: {len(data)}")
    
    # Check lengths
    lengths = []
    for item in data:
        if "messages" in item:
            total_len = sum(len(m["content"]) for m in item["messages"])
        elif "instruction" in item:
            total_len = len(item["instruction"]) + len(item.get("output", ""))
        else:
            total_len = len(item.get("text", ""))
        lengths.append(total_len)
    
    print(f"Average length: {sum(lengths)/len(lengths):.0f} characters")
    print(f"Shortest: {min(lengths)} characters")
    print(f"Longest: {max(lengths)} characters")
    
    # Check duplicates
    if "instruction" in data[0]:
        instructions = [d["instruction"] for d in data]
        duplicates = len(instructions) - len(set(instructions))
        print(f"Duplicates: {duplicates}")
    
    return data

data = analyze_dataset("my_dataset.jsonl")

Data Cleaning

1. Removing Bad Examples

def clean_dataset(data):
    cleaned = []
    removed = {"empty": 0, "too_short": 0, "too_long": 0, "duplicate": 0}
    seen = set()
    
    for item in data:
        instruction = item.get("instruction", "")
        output = item.get("output", "")
        
        # Remove empty entries
        if not instruction.strip() or not output.strip():
            removed["empty"] += 1
            continue
        
        # Remove very short entries
        if len(output) < 20:
            removed["too_short"] += 1
            continue
        
        # Remove very long entries
        if len(instruction) + len(output) > 8000:
            removed["too_long"] += 1
            continue
        
        # Remove duplicates
        key = instruction.strip().lower()
        if key in seen:
            removed["duplicate"] += 1
            continue
        seen.add(key)
        
        cleaned.append(item)
    
    print(f"Before cleaning: {len(data)}")
    print(f"After cleaning: {len(cleaned)}")
    print(f"Removed: {removed}")
    
    return cleaned

2. Text Normalization

import re

def normalize_text(text):
    """Normalize text for training"""
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text)
    
    # Normalize quotes
    text = text.replace('“', '"').replace('”', '"')
    text = text.replace('‘', "'").replace('’', "'")
    
    return text.strip()

# Apply to dataset
for item in data:
    item["instruction"] = normalize_text(item["instruction"])
    item["output"] = normalize_text(item["output"])

Deduplication: Removing Duplicates

Duplicate data is one of the worst problems. It causes the model to overfit and merely memorize those examples.

from datasketch import MinHash, MinHashLSH

def remove_near_duplicates(data, threshold=0.8):
    """Remove near-duplicate examples using MinHash LSH"""
    lsh = MinHashLSH(threshold=threshold, num_perm=128)
    unique_data = []
    
    for i, item in enumerate(data):
        text = item.get("instruction", "") + " " + item.get("output", "")
        
        # Build MinHash
        mh = MinHash(num_perm=128)
        for word in text.split():
            mh.update(word.encode('utf-8'))
        
        # Check for duplicates
        result = lsh.query(mh)
        if not result:
            lsh.insert(str(i), mh)
            unique_data.append(item)
    
    print(f"Before: {len(data)} -> After: {len(unique_data)}")
    return unique_data

Data Augmentation: Increasing Your Data

If you have limited data, you can use various techniques to create more:

# Method 1: Question Paraphrasing
def augment_with_llm(instruction, model="gpt-4"):
    """Create different versions of a question using an LLM"""
    prompt = f"""Rewrite the following question in 3 different ways.
    Don't change the meaning, just change the phrasing.
    
    Original question: {instruction}
    
    3 rewrites:"""
    
    # Call API
    response = call_llm(prompt, model=model)
    return parse_paraphrases(response)

# Method 2: Synthetic Data Generation
def generate_synthetic_data(topic, num_samples=100):
    """Generate synthetic question-answer pairs"""
    prompt = f"""For the topic "{topic}", create {num_samples} educational 
    question-answer pairs. Answers should be accurate and complete.
    
    Output format:
    Q: question
    A: answer
    ---"""
    
    response = call_llm(prompt)
    return parse_qa_pairs(response)

# Method 3: Format Variation
def change_format(item):
    """Convert example format"""
    variations = []
    
    # Convert to yes/no question
    variations.append({
        "instruction": f"Is the following statement correct? {item['output'][:50]}...",
        "output": "Yes, this statement is correct."
    })
    
    # Convert to sentence completion
    variations.append({
        "instruction": f"Complete the following sentence: {item['output'][:30]}...",
        "output": item['output']
    })
    
    return variations

Converting to Final Format

import json

def convert_to_chat_format(data, system_prompt=""):
    """Convert dataset to conversation format"""
    converted = []
    
    for item in data:
        messages = []
        
        if system_prompt:
            messages.append({"role": "system", "content": system_prompt})
        
        messages.append({"role": "user", "content": item["instruction"]})
        messages.append({"role": "assistant", "content": item["output"]})
        
        converted.append({"messages": messages})
    
    return converted

# Convert and save
chat_data = convert_to_chat_format(
    cleaned_data,
    system_prompt="You are an expert Python programming assistant."
)

# Split into train and validation
from sklearn.model_selection import train_test_split

train_data, val_data = train_test_split(chat_data, test_size=0.1, random_state=42)

# Save
def save_jsonl(data, path):
    with open(path, 'w', encoding='utf-8') as f:
        for item in data:
            f.write(json.dumps(item, ensure_ascii=False) + '\n')

save_jsonl(train_data, "train.jsonl")
save_jsonl(val_data, "val.jsonl")

print(f"Train: {len(train_data)} examples")
print(f"Validation: {len(val_data)} examples")

Dataset Quality Checklist

Before starting fine-tuning, go through this checklist:

  • Are examples free of duplicates?
  • Are answers accurate and error-free?
  • Is the answer style consistent?
  • Is there enough variety in the questions?
  • Have very short or very long examples been removed?
  • Is the format the same across all examples?
  • Has the validation set been separated?
  • Has text normalization been applied?

Remember: 80% of your fine-tuning time should be spent on data preparation. If your data is good, the rest is relatively straightforward.

Summary

Dataset preparation is the most important and time-consuming part of fine-tuning. Focus on quality, not quantity. Clean, normalize, and deduplicate your data. Use tools like Argilla for quality review.

Now that the dataset is ready, in the next episode we'll explore Unsloth — a tool that makes fine-tuning 2x faster with 60% less memory.