The Problem: Big Models, Small GPUs
In the previous episode we saw that LoRA helps a lot, but there’s still a problem. Say you want to fine-tune LLaMA 3.1 70B. Even with LoRA, just loading the model in FP16 takes about 140 gigabytes of memory. No consumer GPU has that much.
That’s where QLoRA comes in — a clever combination of Quantization and LoRA.
What Is Quantization?
Quantization means reducing the precision of numbers. Instead of storing each weight with 16 bits (FP16), store it with 8 bits or even 4 bits. This dramatically reduces memory consumption.
# Memory comparison for a 70B model
model_params = 70e9 # 70 billion parameters
fp32 = model_params * 4 / 1e9 # 4 bytes per param
fp16 = model_params * 2 / 1e9 # 2 bytes per param
int8 = model_params * 1 / 1e9 # 1 byte per param
int4 = model_params * 0.5 / 1e9 # 0.5 bytes per param
print(f"FP32: {fp32:.0f} GB") # 280 GB
print(f"FP16: {fp16:.0f} GB") # 140 GB
print(f"INT8: {int8:.0f} GB") # 70 GB
print(f"INT4: {int4:.0f} GB") # 35 GB
With 4-bit quantization, the 70 billion parameter model needs only 35 GB of memory! That fits in a single A100 80GB or even two RTX 4090s.
QLoRA: The Best of Both Worlds
QLoRA (Quantized LoRA) uses three key innovations:
1. NF4: A New 4-bit Type
NF4 (NormalFloat 4-bit) is a special number format optimized for neural network weights. Since neural network weights typically follow a normal distribution, NF4 distributes its bits to provide more precision in denser ranges.
from transformers import BitsAndBytesConfig
# Configure 4-bit quantization with NF4
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, # 4-bit quantization
bnb_4bit_quant_type="nf4", # NormalFloat 4-bit
bnb_4bit_compute_dtype=torch.bfloat16, # Compute in bfloat16
bnb_4bit_use_double_quant=True, # Double Quantization
)
2. Double Quantization
In regular quantization, each block of weights has a scaling constant that’s 32 bits. Double Quantization goes further and quantizes these scaling constants too! Result: even more memory savings.
# Double Quantization — conceptual
# Regular quantization:
# Weights: 4-bit
# Scaling constant: 32-bit (one per 64 weights)
# Overhead: 32/64 = 0.5 bit per param
# Double Quantization:
# Weights: 4-bit
# Scaling constant: quantized to 8-bit
# Overhead: 8/64 + 32/1024 ≈ 0.16 bit per param
# Savings: ~0.34 bit per param
# For a 70B model: 0.34 x 70e9 / 8 / 1e9 ≈ 3 GB less!
3. Paged Optimizers
When the GPU runs out of memory, QLoRA uses a paging mechanism to temporarily move optimizer states to system RAM. This way the GPU never throws an out-of-memory error.
Practical QLoRA Implementation
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
# 1. Configure Quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
# 2. Load model with 4-bit quantization
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
quantization_config=bnb_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
tokenizer.pad_token = tokenizer.eos_token
# 3. Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)
# 4. Configure LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
# 5. Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 41,943,040 || all params: 4,579,280,896
# || trainable%: 0.9160%
Memory Usage Comparison
# Practical comparison — LLaMA 3.1 8B model
methods = {
"Full Fine-tuning (FP16)": "~60 GB VRAM",
"LoRA (FP16)": "~18 GB VRAM",
"QLoRA (4-bit + LoRA)": "~6 GB VRAM",
}
for method, memory in methods.items():
print(f"{method}: {memory}")
# LLaMA 3.1 70B model
methods_70b = {
"Full Fine-tuning (FP16)": "~500 GB VRAM (impossible)",
"LoRA (FP16)": "~150 GB VRAM (multi-GPU)",
"QLoRA (4-bit + LoRA)": "~48 GB VRAM (1x A100 80GB!)",
}
The bitsandbytes Library
bitsandbytes is the library that handles quantization. Installation is simple:
# Installation
# pip install bitsandbytes
# Verify installation
import bitsandbytes as bnb
print(bnb.__version__)
# Check CUDA
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")
Important Notes about bitsandbytes
- Only works on NVIDIA GPUs (requires CUDA)
- For Mac with Apple Silicon, use MLX instead
- You need version 0.42+ for best performance
Full Training with QLoRA
from datasets import load_dataset
# 1. Load dataset
dataset = load_dataset("json", data_files="my_data.jsonl", split="train")
# 2. Format data
def format_chat(example):
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": example["question"]},
{"role": "assistant", "content": example["answer"]},
]
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=False
)
return {"text": text}
dataset = dataset.map(format_chat)
# 3. Training configuration
training_args = TrainingArguments(
output_dir="./qlora-output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # effective batch size = 16
learning_rate=2e-4,
warmup_ratio=0.03,
logging_steps=10,
save_strategy="epoch",
fp16=False,
bf16=True, # bfloat16 for computation
optim="paged_adamw_8bit", # Paged optimizer
gradient_checkpointing=True, # Extra memory savings
max_grad_norm=0.3,
lr_scheduler_type="cosine",
)
# 4. Start training
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
args=training_args,
max_seq_length=2048,
dataset_text_field="text",
)
trainer.train()
Suitable GPUs for QLoRA
With QLoRA, fine-tuning is no longer limited to expensive GPUs:
- RTX 3060 12GB: 7-8B models (batch size=1)
- RTX 3090 / 4090 24GB: 7-8B models comfortably, 13B with tuning
- A100 40GB: Models up to 30B
- A100 80GB: 70B models
# Suggest batch size based on GPU memory
def suggest_batch_size(gpu_vram_gb, model_size_b, quantization_bits=4):
"""Suggest batch size based on GPU and model"""
# Estimate quantized model memory
model_memory = model_size_b * quantization_bits / 8 / 1e9
# Remaining memory for training
available = gpu_vram_gb - model_memory - 2 # 2GB overhead
if available < 2:
return 1, 8 # batch_size=1, grad_accum=8
elif available < 6:
return 2, 4
elif available < 12:
return 4, 2
else:
return 8, 1
# Example
bs, ga = suggest_batch_size(gpu_vram_gb=24, model_size_b=8)
print(f"Batch size: {bs}, Gradient accumulation: {ga}")
# Output: Batch size: 4, Gradient accumulation: 2
QLoRA on Cloud Services
If you don't have a suitable GPU, you can use cloud services:
- Google Colab (free): T4 16GB — for 7-8B models
- Google Colab Pro: A100 40GB — for models up to 30B
- RunPod / Lambda Labs: Choose your GPU at hourly rates
- Kaggle: Free P100 16GB or T4 16GB
Does Quantization Reduce Quality?
Good question. When you reduce numerical precision, some information is lost. But research has shown:
The original QLoRA paper demonstrated that QLoRA (4-bit) performance is approximately equal to full fine-tuning (16-bit). The difference in most benchmarks is less than 1%.
The reason is that quantized weights are only used for the forward pass. Gradients and updates are computed at higher precision (bfloat16).
Important Tips
- Use
bnb_4bit_compute_dtype=torch.bfloat16(not float16) - Enable
gradient_checkpointing=Truefor extra memory savings - Use
optim="paged_adamw_8bit"for the paged optimizer - If GPU memory is low, reduce batch_size and increase gradient_accumulation
- A quantized model can't be merged directly — it needs to be dequantized first
Summary
QLoRA has made the democratization of fine-tuning a reality. Now anyone with a regular GPU can fine-tune large models. The combination of 4-bit quantization with LoRA dramatically reduces memory usage without sacrificing quality.
But the best model and the best tools are useless without good data. In the next episode, we'll explore the most important part of fine-tuning: dataset preparation.