Fine-Tuning for the Persian Language — Challenges and Solutions

Episode 9 22 min

The Persian Language and Language Models

When you work with English language models, everything is relatively straightforward. But Persian has a set of unique challenges that, if you’re not aware of them, might waste hours of your time and yield poor results.

Let’s start with the most important challenge: tokenization.

The Persian Tokenization Problem

Language model tokenizers convert text into tokens (small pieces). The problem is that most tokenizers are optimized for English and convert Persian into very small tokens (even UTF-8 bytes).

from transformers import AutoTokenizer

# Compare Persian tokenization across different models
models = {
    "LLaMA 3.1": "meta-llama/Llama-3.1-8B-Instruct",
    "Qwen 2.5": "Qwen/Qwen2.5-7B-Instruct",
    "Gemma 2": "google/gemma-2-9b-it",
}

persian_text = "Artificial intelligence has made remarkable progress in recent years."

for name, model_id in models.items():
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    tokens = tokenizer.tokenize(persian_text)
    token_ids = tokenizer.encode(persian_text)
    
    print(f"\n{name}:")
    print(f"  Token count: {len(token_ids)}")
    print(f"  Tokens: {tokens[:10]}...")
    print(f"  Token/word ratio: {len(token_ids)/len(persian_text.split()):.1f}")

Why Token Count Matters

  • Higher cost: More tokens means higher API costs
  • Smaller context window: Persian text consumes 2-3x more tokens
  • Slower training: Persian examples have more tokens
  • Lower quality: When words are broken into bytes, the model learns language structure poorly
Important note: Models with better tokenizers for Persian (like Qwen) produce better fine-tuning results. Always check Persian tokenization before choosing a model.

Choosing the Right Model for Persian

Not all models are suitable for Persian. Some have seen very little Persian in their pre-training data.

# Quick evaluation of models for Persian
def evaluate_persian_capability(model_name):
    """Check a model's Persian capability"""
    from transformers import pipeline
    
    generator = pipeline("text-generation", model=model_name, device_map="auto")
    
    test_prompts = [
        "The capital of Iran",
        "Nowruz celebration",
        "Python programming means",
        "The advantages of artificial intelligence include",
    ]
    
    print(f"\nModel: {model_name}")
    for prompt in test_prompts:
        output = generator(prompt, max_new_tokens=50, do_sample=False)
        text = output[0]["generated_text"]
        print(f"  {prompt} -> {text[len(prompt):80]}...")

# Recommended models for Persian (in order of priority)
recommended = [
    "Qwen/Qwen2.5-7B-Instruct",      # Best Persian tokenizer
    "Qwen/Qwen2.5-14B-Instruct",     # Larger and better
    "google/gemma-2-9b-it",           # Good Persian performance
    "meta-llama/Llama-3.1-8B-Instruct",  # Average Persian performance
]

Why Qwen Is Better for Persian

  • Has a more optimized tokenizer for non-Latin languages
  • Stronger multilingual pre-training data
  • Has shown good performance on RTL languages
  • Produces fewer Persian tokens

Persian Dataset Resources

# Free Persian dataset sources
persian_datasets = {
    # General datasets
    "Persian Wikipedia": "wikimedia/wikipedia (fa)",
    "CC-100 Persian": "cc100 (fa subset)",
    "OSCAR Persian": "oscar-corpus/OSCAR-2301 (fa)",
    
    # NLP datasets
    "ParsiNLU": "persiannlp/parsinlu",        # NLU tasks
    "FarsTail": "persiannlp/farstail",         # NLI
    "Persian QA": "SajjadAyoubi/persian_qa",   # Question Answering
    "PN-Summary": "hooshvare/pn_summary",      # News summarization
    
    # Conversational datasets
    "Persian Alpaca": "jondurbin/airoboros-gpt4-1.4.1-persian",
}

# Load a Persian dataset
from datasets import load_dataset

# Example: Persian QA
dataset = load_dataset("SajjadAyoubi/persian_qa", split="train")
print(f"Number of examples: {len(dataset)}")
print(f"Sample: {dataset[0]}")

Building a Persian Dataset via Translation

def translate_dataset(english_data, batch_size=10):
    """Translate an English dataset to Persian"""
    persian_data = []
    
    for i in range(0, len(english_data), batch_size):
        batch = english_data[i:i+batch_size]
        
        for item in batch:
            # Translate with LLM
            prompt = f"""Translate the following Q&A pair to Persian.
Keep technical terms in English if commonly used.
Maintain the same level of detail.

Question: {item['instruction']}
Answer: {item['output']}

Persian translation:
Question (Persian):"""
            
            translation = call_llm(prompt)
            parsed = parse_translation(translation)
            
            persian_data.append({
                "instruction": parsed["question"],
                "output": parsed["answer"],
                "original_en": item["instruction"],
            })
    
    return persian_data

# Important: always review quality after translation
# Machine translation may contain errors

Managing RTL (Right-to-Left)

Persian is a right-to-left (RTL) language. This doesn’t cause special problems in fine-tuning (since the model works with tokens, not display direction), but there are a few points to note:

# Note 1: Mixing Persian and English (Mixed RTL/LTR)
mixed_text = "To install the library use the pip install transformers command."
# This text mixes RTL (Persian) and LTR (English)
# The model usually handles this well, but generation might have issues

# Note 2: Unicode normalization for Persian
import unicodedata

def normalize_persian_text(text):
    """Complete Persian text normalization"""
    # NFC normalization
    text = unicodedata.normalize('NFC', text)
    
    # Standardize Arabic-Persian characters
    replacements = {
        'ك': 'ک',  # Arabic kaf -> Persian kaf
        'ي': 'ی',  # Arabic ya -> Persian ya
    }
    for old, new in replacements.items():
        text = text.replace(old, new)
    
    return text

Evaluating a Persian Model

def evaluate_persian_model(model, tokenizer, test_data):
    """Comprehensive evaluation of a Persian model"""
    results = {
        "fluency": [],       # Persian language fluency
        "accuracy": [],      # Content accuracy
        "mixed_lang": [],    # Persian-English mixing handling
        "formality": [],     # Formality level compliance
    }
    
    for item in test_data:
        response = generate_response(model, tokenizer, item["prompt"])
        
        # Automated checks
        # 1. Is the output in Persian?
        persian_ratio = count_persian_chars(response) / max(len(response), 1)
        
        # 2. Are English technical terms preserved?
        tech_terms_preserved = check_tech_terms(response, item.get("expected_terms", []))
        
        print(f"Prompt: {item['prompt'][:50]}...")
        print(f"  Persian ratio: {persian_ratio:.0%}")
        print(f"  Technical terms: {'correct' if tech_terms_preserved else 'issues'}")

def count_persian_chars(text):
    """Count Persian/Arabic characters"""
    import re
    persian_pattern = re.compile(r'[؀-ۿݐ-ݿﭐ-﷿ﹰ-]')
    return len(persian_pattern.findall(text))

Common Mistakes

  • Using an unsuitable model: A model that has seen little Persian won’t improve much even with fine-tuning
  • Ignoring normalization: Arabic and Persian characters get mixed up
  • Direct translation: Machine translation without quality review
  • Ignoring zero-width non-joiner: Text without proper ZWNJ looks unnatural in Persian
  • Context window: Forgetting that Persian consumes 2-3x more tokens

The best strategy for Persian fine-tuning: Qwen model + high-quality Persian dataset + careful normalization + human evaluation by a native Persian speaker.

Recommended Process

# Complete Persian fine-tuning process

# 1. Choose model — Qwen 2.5 is recommended
model_name = "Qwen/Qwen2.5-7B-Instruct"

# 2. Prepare dataset
# - Collect high-quality Persian data
# - Unicode normalization
# - Check zero-width non-joiners
# - Remove duplicates

# 3. Configure tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Check Persian tokenization quality
sample = "Python programming"
print(f"Tokens: {tokenizer.tokenize(sample)}")
print(f"Count: {len(tokenizer.encode(sample))}")

# 4. Fine-tune with Unsloth (QLoRA)
# Set max_seq_length higher since Persian uses more tokens
max_seq_length = 4096  # Instead of 2048

# 5. Evaluation
# Make sure to have a native Persian-speaking evaluator

Summary

Fine-tuning for Persian has its own unique challenges, but it’s far from impossible. The key points: choose the right model (Qwen), normalize your dataset properly, and always have human evaluation by a native Persian speaker.

In the final episode, we’ll cover the journey from fine-tuning to deployment — how to serve your fine-tuned model and use it in a real-world environment.