The Problem with Full Fine-Tuning
Say you want to fine-tune a 7 billion parameter model like LLaMA 3.1 8B. If you want to train all parameters (full fine-tuning), what do you need?
- Model memory: 7 billion parameters x 2 bytes (FP16) = 14 GB
- Gradients: Another 14 GB
- Optimizer states (AdamW): 28 GB (double the weights)
- Activations: Depending on batch size, 10-20 GB
- Total: At least 60-70 GB of VRAM!
That means even with an A100 80GB it’s tight. Now what if you want to fine-tune a 70 billion parameter model? You’d need hundreds of gigabytes of VRAM!
What Is LoRA?
LoRA stands for Low-Rank Adaptation. The idea is simple yet brilliant:
Instead of modifying all model weights, just add a set of small matrices (adapters) and train only those. The original model weights stay frozen and unchanged.
Simple Analogy
Imagine you have a 500-page book. Instead of rewriting the entire book, you write a small 10-page notebook to be used alongside the original. The original book stays untouched, and the notebook contains only the changes and additional notes.
LoRA Math — Simplified
In neural networks, Linear layers have a weight matrix W. For example, a 4096×4096 matrix (about 16 million parameters).
LoRA says: instead of modifying W, add a small change ΔW to it. And write this ΔW as the product of two smaller matrices:
# Without LoRA:
# W_new = W_original + ΔW
# ΔW is the same size as W: 4096 x 4096 = 16,777,216 parameters
# With LoRA:
# ΔW = A x B
# A: 4096 x r (e.g., r=16)
# B: r x 4096
# Parameter count: 4096x16 + 16x4096 = 131,072
# Only 0.78% of original parameters!
import torch
import torch.nn as nn
class LoRALayer(nn.Module):
def __init__(self, in_features, out_features, rank=16, alpha=32):
super().__init__()
# Original weights — frozen
self.linear = nn.Linear(in_features, out_features, bias=False)
self.linear.weight.requires_grad = False # Frozen!
# LoRA matrices — only these get trained
self.lora_A = nn.Parameter(torch.randn(in_features, rank) * 0.01)
self.lora_B = nn.Parameter(torch.zeros(rank, out_features))
self.scaling = alpha / rank
def forward(self, x):
# Original output + LoRA modification
original = self.linear(x)
lora_output = (x @ self.lora_A @ self.lora_B) * self.scaling
return original + lora_output
What Is Low-Rank Decomposition?
When you decompose a large matrix into the product of two smaller matrices, it’s called low-rank decomposition. The number r (rank) determines how small the smaller matrices are.
# Comparing parameter counts
d = 4096 # Model dimension
print("Full Fine-tuning:")
print(f" Parameters: {d * d:,} = {d * d / 1e6:.1f}M")
for r in [4, 8, 16, 32, 64]:
lora_params = 2 * d * r
ratio = lora_params / (d * d) * 100
print(f"LoRA (rank={r}):")
print(f" Parameters: {lora_params:,} ({ratio:.2f}% of original)")
Output:
Full Fine-tuning:
Parameters: 16,777,216 = 16.8M
LoRA (rank=4):
Parameters: 32,768 (0.20% of original)
LoRA (rank=8):
Parameters: 65,536 (0.39% of original)
LoRA (rank=16):
Parameters: 131,072 (0.78% of original)
LoRA (rank=32):
Parameters: 262,144 (1.56% of original)
LoRA (rank=64):
Parameters: 524,288 (3.12% of original)
Key LoRA Parameters
Rank (r)
Rank determines how large the adapter matrices are. Higher rank = more learning capacity but more memory.
- r=8: For simple tasks (like style changes)
- r=16: Good default for most tasks
- r=32-64: For complex tasks (like learning a new language)
- r=128+: Rare and usually unnecessary
Alpha (α)
Alpha is a scaling factor. The LoRA output is multiplied by alpha/rank. Higher alpha means more LoRA influence.
# Relationship between alpha and rank
# scaling = alpha / rank
# Example 1: rank=16, alpha=16 -> scaling=1.0
# Example 2: rank=16, alpha=32 -> scaling=2.0 (more LoRA influence)
# Example 3: rank=16, alpha=8 -> scaling=0.5 (less LoRA influence)
# General rule: set alpha to 2x the rank
lora_config = {
"r": 16,
"lora_alpha": 32, # 2 x rank
}
Target Modules
Which layers should LoRA be applied to? In Transformer models, typically the attention layers:
from peft import LoraConfig
config = LoraConfig(
r=16,
lora_alpha=32,
# Which layers get LoRA
target_modules=[
"q_proj", # Query projection
"k_proj", # Key projection
"v_proj", # Value projection
"o_proj", # Output projection
"gate_proj", # MLP gate
"up_proj", # MLP up
"down_proj", # MLP down
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
Practical Implementation with PEFT
The PEFT library from Hugging Face makes LoRA very straightforward:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import get_peft_model, LoraConfig
# 1. Load base model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
torch_dtype=torch.float16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
# 2. Configure LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
# 3. Apply LoRA to model
model = get_peft_model(model, lora_config)
# See how many parameters are being trained
model.print_trainable_parameters()
# Output: trainable params: 13,631,488 || all params: 8,043,847,680
# || trainable%: 0.1695%
That means you’re training only 0.17% of parameters! The rest are frozen.
Benefits of LoRA
- Less memory: 70-80% less than full fine-tuning
- Faster training: Fewer parameters = faster training
- Original model untouched: You can have multiple different adapters
- Small storage: An adapter is only tens of megabytes (not gigabytes)
- Composable: You can combine or swap different adapters
Merging the Adapter
After training, you can merge the LoRA adapter with the original model. This creates a single model with no extra overhead during inference:
# After training — merge adapter with base model
merged_model = model.merge_and_unload()
# Save the merged model
merged_model.save_pretrained("./my-merged-model")
tokenizer.save_pretrained("./my-merged-model")
# Or save just the adapter (much smaller)
model.save_pretrained("./my-lora-adapter")
# This is only a few tens of megabytes!
Multiple Adapters for Multiple Tasks
# You can have multiple different adapters
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
# For translation
translation_model = PeftModel.from_pretrained(base_model, "./adapter-translation")
# For summarization
summary_model = PeftModel.from_pretrained(base_model, "./adapter-summary")
# For coding
code_model = PeftModel.from_pretrained(base_model, "./adapter-code")
# One base model, three different behaviors!
LoRA vs Full Fine-Tuning
- Full fine-tuning 8B model: ~60+ GB VRAM, hours of training, final file 16 GB
- LoRA 8B model: ~16-20 GB VRAM, faster, adapter only 50-100 MB
Research has shown that LoRA performs close to full fine-tuning in most tasks. Only in very complex tasks might it be slightly weaker — which can usually be compensated by increasing the rank.
Practical Tips
- Start with rank=16 and alpha=32 — sufficient for most tasks
- If results aren’t good, increase rank (32 or 64)
- Set lora_dropout between 0.0 and 0.1
- If resources allow, include all linear layers in target_modules
- Set the learning rate slightly higher for LoRA than for full fine-tuning (e.g., 2e-4)
Summary
LoRA is an outstanding technique that has turned fine-tuning from an expensive, complex task into something practical and accessible. With LoRA, you can fine-tune large models on regular GPUs.
But even with LoRA, very large models (like 70B) still require significant GPU resources. In the next episode, we’ll explore QLoRA — which combines quantization and LoRA so that even 70 billion parameter models can be fine-tuned with a regular graphics card.