LoRA — Specializing a Model Without Changing the Whole Thing

The Problem with Full Fine-Tuning

Say you want to fine-tune a 7 billion parameter model like LLaMA 3.1 8B. If you want to train all parameters (full fine-tuning), what do you need?

Model memory: 7 billion parameters x 2 bytes (FP16) = 14 GB
Gradients: Another 14 GB
Optimizer states (AdamW): 28 GB (double the weights)
Activations: Depending on batch size, 10-20 GB
Total: At least 60-70 GB of VRAM!

That means even with an A100 80GB it’s tight. Now what if you want to fine-tune a 70 billion parameter model? You’d need hundreds of gigabytes of VRAM!

The core problem: Full fine-tuning is expensive, slow, and impractical for most of us. We need a better way.

What Is LoRA?

LoRA stands for Low-Rank Adaptation. The idea is simple yet brilliant:

Instead of modifying all model weights, just add a set of small matrices (adapters) and train only those. The original model weights stay frozen and unchanged.

Simple Analogy

Imagine you have a 500-page book. Instead of rewriting the entire book, you write a small 10-page notebook to be used alongside the original. The original book stays untouched, and the notebook contains only the changes and additional notes.

LoRA Math — Simplified

In neural networks, Linear layers have a weight matrix W. For example, a 4096×4096 matrix (about 16 million parameters).

LoRA says: instead of modifying W, add a small change ΔW to it. And write this ΔW as the product of two smaller matrices:

# Without LoRA:
# W_new = W_original + ΔW
# ΔW is the same size as W: 4096 x 4096 = 16,777,216 parameters

# With LoRA:
# ΔW = A x B
# A: 4096 x r (e.g., r=16)
# B: r x 4096
# Parameter count: 4096x16 + 16x4096 = 131,072
# Only 0.78% of original parameters!

import torch
import torch.nn as nn

class LoRALayer(nn.Module):
    def __init__(self, in_features, out_features, rank=16, alpha=32):
        super().__init__()
        # Original weights — frozen
        self.linear = nn.Linear(in_features, out_features, bias=False)
        self.linear.weight.requires_grad = False  # Frozen!
        
        # LoRA matrices — only these get trained
        self.lora_A = nn.Parameter(torch.randn(in_features, rank) * 0.01)
        self.lora_B = nn.Parameter(torch.zeros(rank, out_features))
        
        self.scaling = alpha / rank
    
    def forward(self, x):
        # Original output + LoRA modification
        original = self.linear(x)
        lora_output = (x @ self.lora_A @ self.lora_B) * self.scaling
        return original + lora_output

What Is Low-Rank Decomposition?

When you decompose a large matrix into the product of two smaller matrices, it’s called low-rank decomposition. The number r (rank) determines how small the smaller matrices are.

# Comparing parameter counts
d = 4096  # Model dimension

print("Full Fine-tuning:")
print(f"  Parameters: {d * d:,} = {d * d / 1e6:.1f}M")

for r in [4, 8, 16, 32, 64]:
    lora_params = 2 * d * r
    ratio = lora_params / (d * d) * 100
    print(f"LoRA (rank={r}):")
    print(f"  Parameters: {lora_params:,} ({ratio:.2f}% of original)")

Output:

Full Fine-tuning:
  Parameters: 16,777,216 = 16.8M
LoRA (rank=4):
  Parameters: 32,768 (0.20% of original)
LoRA (rank=8):
  Parameters: 65,536 (0.39% of original)
LoRA (rank=16):
  Parameters: 131,072 (0.78% of original)
LoRA (rank=32):
  Parameters: 262,144 (1.56% of original)
LoRA (rank=64):
  Parameters: 524,288 (3.12% of original)

Key LoRA Parameters

Rank (r)

Rank determines how large the adapter matrices are. Higher rank = more learning capacity but more memory.

r=8: For simple tasks (like style changes)
r=16: Good default for most tasks
r=32-64: For complex tasks (like learning a new language)
r=128+: Rare and usually unnecessary

Alpha (α)

Alpha is a scaling factor. The LoRA output is multiplied by alpha/rank. Higher alpha means more LoRA influence.

# Relationship between alpha and rank
# scaling = alpha / rank

# Example 1: rank=16, alpha=16 -> scaling=1.0
# Example 2: rank=16, alpha=32 -> scaling=2.0 (more LoRA influence)
# Example 3: rank=16, alpha=8  -> scaling=0.5 (less LoRA influence)

# General rule: set alpha to 2x the rank
lora_config = {
    "r": 16,
    "lora_alpha": 32,  # 2 x rank
}

Target Modules

Which layers should LoRA be applied to? In Transformer models, typically the attention layers:

from peft import LoraConfig

config = LoraConfig(
    r=16,
    lora_alpha=32,
    # Which layers get LoRA
    target_modules=[
        "q_proj",    # Query projection
        "k_proj",    # Key projection
        "v_proj",    # Value projection
        "o_proj",    # Output projection
        "gate_proj", # MLP gate
        "up_proj",   # MLP up
        "down_proj", # MLP down
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

Note: The more layers you target, the more the model can learn, but it uses more memory and time. Starting with q_proj and v_proj is fine, then add more if needed.

Practical Implementation with PEFT

The PEFT library from Hugging Face makes LoRA very straightforward:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import get_peft_model, LoraConfig

# 1. Load base model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

# 2. Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

# 3. Apply LoRA to model
model = get_peft_model(model, lora_config)

# See how many parameters are being trained
model.print_trainable_parameters()
# Output: trainable params: 13,631,488 || all params: 8,043,847,680
#          || trainable%: 0.1695%

That means you’re training only 0.17% of parameters! The rest are frozen.

Benefits of LoRA

Less memory: 70-80% less than full fine-tuning
Faster training: Fewer parameters = faster training
Original model untouched: You can have multiple different adapters
Small storage: An adapter is only tens of megabytes (not gigabytes)
Composable: You can combine or swap different adapters

Merging the Adapter

After training, you can merge the LoRA adapter with the original model. This creates a single model with no extra overhead during inference:

# After training — merge adapter with base model
merged_model = model.merge_and_unload()

# Save the merged model
merged_model.save_pretrained("./my-merged-model")
tokenizer.save_pretrained("./my-merged-model")

# Or save just the adapter (much smaller)
model.save_pretrained("./my-lora-adapter")
# This is only a few tens of megabytes!

Multiple Adapters for Multiple Tasks

# You can have multiple different adapters
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

# For translation
translation_model = PeftModel.from_pretrained(base_model, "./adapter-translation")

# For summarization
summary_model = PeftModel.from_pretrained(base_model, "./adapter-summary")

# For coding
code_model = PeftModel.from_pretrained(base_model, "./adapter-code")

# One base model, three different behaviors!

LoRA vs Full Fine-Tuning

Full fine-tuning 8B model: ~60+ GB VRAM, hours of training, final file 16 GB
LoRA 8B model: ~16-20 GB VRAM, faster, adapter only 50-100 MB

Research has shown that LoRA performs close to full fine-tuning in most tasks. Only in very complex tasks might it be slightly weaker — which can usually be compensated by increasing the rank.

Practical Tips

Start with rank=16 and alpha=32 — sufficient for most tasks
If results aren’t good, increase rank (32 or 64)
Set lora_dropout between 0.0 and 0.1
If resources allow, include all linear layers in target_modules
Set the learning rate slightly higher for LoRA than for full fine-tuning (e.g., 2e-4)

Summary

LoRA is an outstanding technique that has turned fine-tuning from an expensive, complex task into something practical and accessible. With LoRA, you can fine-tune large models on regular GPUs.

But even with LoRA, very large models (like 70B) still require significant GPU resources. In the next episode, we’ll explore QLoRA — which combines quantization and LoRA so that even 70 billion parameter models can be fine-tuned with a regular graphics card.