Why Data Is the Most Important Part
There’s a saying in machine learning: “Garbage in, garbage out.” If your data is bad, your model will be bad too — no matter how powerful your GPU is or what technique you use.
Fine-tuning is like teaching a smart student. If you give them bad textbooks, they’ll learn badly. If you give them good but few textbooks, that’s still better than giving them a pile of bad ones.
Common Dataset Formats
1. Instruction Format
The simplest format. An instruction and an output:
# Instruction format — each line is a JSON object
{"instruction": "Explain the difference between a list and a tuple.",
"output": "A list is mutable, meaning you can modify its elements after creation. A tuple is immutable and cannot be changed after creation."}
{"instruction": "Write a function that calculates factorial.",
"output": "def factorial(n):\n if n <= 1:\n return 1\n return n * factorial(n - 1)"}
# With optional input
{"instruction": "Summarize the following text.",
"input": "Artificial intelligence is a branch of computer science that...",
"output": "Artificial intelligence is a field in computer science for..."}
2. Conversation Format
For chat models. Each example is a complete conversation:
# Chat/Conversation format
{
"messages": [
{"role": "system", "content": "You are a Python programming assistant."},
{"role": "user", "content": "How do I read a CSV file?"},
{"role": "assistant", "content": "To read a CSV file in Python, use the pandas library:\n\nimport pandas as pd\ndf = pd.read_csv('data.csv')\nprint(df.head())"},
{"role": "user", "content": "What if the file uses tab delimiters?"},
{"role": "assistant", "content": "Just change the sep parameter:\n\ndf = pd.read_csv('data.tsv', sep='\\t')"}
]
}
3. Completion Format
The simplest format — just raw text:
# Completion format — for text continuation
{"text": "### Question: What's the difference between HTTP and HTTPS?\n\n### Answer: HTTPS is the secure version of HTTP..."}
# Or without special structure
{"text": "In Python programming, a decorator is a function that takes another function..."}
How Much Data Do You Need?
Short answer: it depends. But here's a general guide:
- Style/tone change: 500-1,000 examples
- Simple new task: 1,000-3,000 examples
- Complex task: 3,000-10,000 examples
- New domain knowledge: 5,000-20,000 examples
- New language: 10,000+ examples
# Dataset quality analysis script
import json
from collections import Counter
def analyze_dataset(file_path):
with open(file_path, 'r', encoding='utf-8') as f:
data = [json.loads(line) for line in f]
print(f"Number of examples: {len(data)}")
# Check lengths
lengths = []
for item in data:
if "messages" in item:
total_len = sum(len(m["content"]) for m in item["messages"])
elif "instruction" in item:
total_len = len(item["instruction"]) + len(item.get("output", ""))
else:
total_len = len(item.get("text", ""))
lengths.append(total_len)
print(f"Average length: {sum(lengths)/len(lengths):.0f} characters")
print(f"Shortest: {min(lengths)} characters")
print(f"Longest: {max(lengths)} characters")
# Check duplicates
if "instruction" in data[0]:
instructions = [d["instruction"] for d in data]
duplicates = len(instructions) - len(set(instructions))
print(f"Duplicates: {duplicates}")
return data
data = analyze_dataset("my_dataset.jsonl")
Data Cleaning
1. Removing Bad Examples
def clean_dataset(data):
cleaned = []
removed = {"empty": 0, "too_short": 0, "too_long": 0, "duplicate": 0}
seen = set()
for item in data:
instruction = item.get("instruction", "")
output = item.get("output", "")
# Remove empty entries
if not instruction.strip() or not output.strip():
removed["empty"] += 1
continue
# Remove very short entries
if len(output) < 20:
removed["too_short"] += 1
continue
# Remove very long entries
if len(instruction) + len(output) > 8000:
removed["too_long"] += 1
continue
# Remove duplicates
key = instruction.strip().lower()
if key in seen:
removed["duplicate"] += 1
continue
seen.add(key)
cleaned.append(item)
print(f"Before cleaning: {len(data)}")
print(f"After cleaning: {len(cleaned)}")
print(f"Removed: {removed}")
return cleaned
2. Text Normalization
import re
def normalize_text(text):
"""Normalize text for training"""
# Remove extra whitespace
text = re.sub(r'\s+', ' ', text)
# Normalize quotes
text = text.replace('“', '"').replace('”', '"')
text = text.replace('‘', "'").replace('’', "'")
return text.strip()
# Apply to dataset
for item in data:
item["instruction"] = normalize_text(item["instruction"])
item["output"] = normalize_text(item["output"])
Deduplication: Removing Duplicates
Duplicate data is one of the worst problems. It causes the model to overfit and merely memorize those examples.
from datasketch import MinHash, MinHashLSH
def remove_near_duplicates(data, threshold=0.8):
"""Remove near-duplicate examples using MinHash LSH"""
lsh = MinHashLSH(threshold=threshold, num_perm=128)
unique_data = []
for i, item in enumerate(data):
text = item.get("instruction", "") + " " + item.get("output", "")
# Build MinHash
mh = MinHash(num_perm=128)
for word in text.split():
mh.update(word.encode('utf-8'))
# Check for duplicates
result = lsh.query(mh)
if not result:
lsh.insert(str(i), mh)
unique_data.append(item)
print(f"Before: {len(data)} -> After: {len(unique_data)}")
return unique_data
Data Augmentation: Increasing Your Data
If you have limited data, you can use various techniques to create more:
# Method 1: Question Paraphrasing
def augment_with_llm(instruction, model="gpt-4"):
"""Create different versions of a question using an LLM"""
prompt = f"""Rewrite the following question in 3 different ways.
Don't change the meaning, just change the phrasing.
Original question: {instruction}
3 rewrites:"""
# Call API
response = call_llm(prompt, model=model)
return parse_paraphrases(response)
# Method 2: Synthetic Data Generation
def generate_synthetic_data(topic, num_samples=100):
"""Generate synthetic question-answer pairs"""
prompt = f"""For the topic "{topic}", create {num_samples} educational
question-answer pairs. Answers should be accurate and complete.
Output format:
Q: question
A: answer
---"""
response = call_llm(prompt)
return parse_qa_pairs(response)
# Method 3: Format Variation
def change_format(item):
"""Convert example format"""
variations = []
# Convert to yes/no question
variations.append({
"instruction": f"Is the following statement correct? {item['output'][:50]}...",
"output": "Yes, this statement is correct."
})
# Convert to sentence completion
variations.append({
"instruction": f"Complete the following sentence: {item['output'][:30]}...",
"output": item['output']
})
return variations
Converting to Final Format
import json
def convert_to_chat_format(data, system_prompt=""):
"""Convert dataset to conversation format"""
converted = []
for item in data:
messages = []
if system_prompt:
messages.append({"role": "system", "content": system_prompt})
messages.append({"role": "user", "content": item["instruction"]})
messages.append({"role": "assistant", "content": item["output"]})
converted.append({"messages": messages})
return converted
# Convert and save
chat_data = convert_to_chat_format(
cleaned_data,
system_prompt="You are an expert Python programming assistant."
)
# Split into train and validation
from sklearn.model_selection import train_test_split
train_data, val_data = train_test_split(chat_data, test_size=0.1, random_state=42)
# Save
def save_jsonl(data, path):
with open(path, 'w', encoding='utf-8') as f:
for item in data:
f.write(json.dumps(item, ensure_ascii=False) + '\n')
save_jsonl(train_data, "train.jsonl")
save_jsonl(val_data, "val.jsonl")
print(f"Train: {len(train_data)} examples")
print(f"Validation: {len(val_data)} examples")
Dataset Quality Checklist
Before starting fine-tuning, go through this checklist:
- Are examples free of duplicates?
- Are answers accurate and error-free?
- Is the answer style consistent?
- Is there enough variety in the questions?
- Have very short or very long examples been removed?
- Is the format the same across all examples?
- Has the validation set been separated?
- Has text normalization been applied?
Remember: 80% of your fine-tuning time should be spent on data preparation. If your data is good, the rest is relatively straightforward.
Summary
Dataset preparation is the most important and time-consuming part of fine-tuning. Focus on quality, not quantity. Clean, normalize, and deduplicate your data. Use tools like Argilla for quality review.
Now that the dataset is ready, in the next episode we'll explore Unsloth — a tool that makes fine-tuning 2x faster with 60% less memory.