QLoRA — Fine-Tuning with a Consumer-Grade GPU

The Problem: Big Models, Small GPUs

In the previous episode we saw that LoRA helps a lot, but there’s still a problem. Say you want to fine-tune LLaMA 3.1 70B. Even with LoRA, just loading the model in FP16 takes about 140 gigabytes of memory. No consumer GPU has that much.

That’s where QLoRA comes in — a clever combination of Quantization and LoRA.

What Is Quantization?

Quantization means reducing the precision of numbers. Instead of storing each weight with 16 bits (FP16), store it with 8 bits or even 4 bits. This dramatically reduces memory consumption.

# Memory comparison for a 70B model
model_params = 70e9  # 70 billion parameters

fp32 = model_params * 4 / 1e9   # 4 bytes per param
fp16 = model_params * 2 / 1e9   # 2 bytes per param
int8 = model_params * 1 / 1e9   # 1 byte per param
int4 = model_params * 0.5 / 1e9 # 0.5 bytes per param

print(f"FP32: {fp32:.0f} GB")   # 280 GB
print(f"FP16: {fp16:.0f} GB")   # 140 GB
print(f"INT8: {int8:.0f} GB")   # 70 GB
print(f"INT4: {int4:.0f} GB")   # 35 GB

With 4-bit quantization, the 70 billion parameter model needs only 35 GB of memory! That fits in a single A100 80GB or even two RTX 4090s.

QLoRA: The Best of Both Worlds

QLoRA (Quantized LoRA) uses three key innovations:

1. NF4: A New 4-bit Type

NF4 (NormalFloat 4-bit) is a special number format optimized for neural network weights. Since neural network weights typically follow a normal distribution, NF4 distributes its bits to provide more precision in denser ranges.

from transformers import BitsAndBytesConfig

# Configure 4-bit quantization with NF4
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                  # 4-bit quantization
    bnb_4bit_quant_type="nf4",          # NormalFloat 4-bit
    bnb_4bit_compute_dtype=torch.bfloat16,  # Compute in bfloat16
    bnb_4bit_use_double_quant=True,     # Double Quantization
)

2. Double Quantization

In regular quantization, each block of weights has a scaling constant that’s 32 bits. Double Quantization goes further and quantizes these scaling constants too! Result: even more memory savings.

# Double Quantization — conceptual
# Regular quantization:
#   Weights: 4-bit
#   Scaling constant: 32-bit (one per 64 weights)
#   Overhead: 32/64 = 0.5 bit per param

# Double Quantization:
#   Weights: 4-bit
#   Scaling constant: quantized to 8-bit
#   Overhead: 8/64 + 32/1024 ≈ 0.16 bit per param

# Savings: ~0.34 bit per param
# For a 70B model: 0.34 x 70e9 / 8 / 1e9 ≈ 3 GB less!

3. Paged Optimizers

When the GPU runs out of memory, QLoRA uses a paging mechanism to temporarily move optimizer states to system RAM. This way the GPU never throws an out-of-memory error.

Practical QLoRA Implementation

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer

# 1. Configure Quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# 2. Load model with 4-bit quantization
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
tokenizer.pad_token = tokenizer.eos_token

# 3. Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)

# 4. Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

# 5. Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 41,943,040 || all params: 4,579,280,896
#          || trainable%: 0.9160%

Memory Usage Comparison

# Practical comparison — LLaMA 3.1 8B model
methods = {
    "Full Fine-tuning (FP16)":     "~60 GB VRAM",
    "LoRA (FP16)":                 "~18 GB VRAM",
    "QLoRA (4-bit + LoRA)":        "~6 GB VRAM",
}

for method, memory in methods.items():
    print(f"{method}: {memory}")

# LLaMA 3.1 70B model
methods_70b = {
    "Full Fine-tuning (FP16)":     "~500 GB VRAM (impossible)",
    "LoRA (FP16)":                 "~150 GB VRAM (multi-GPU)",
    "QLoRA (4-bit + LoRA)":        "~48 GB VRAM (1x A100 80GB!)",
}

Key takeaway: With QLoRA you can train a 70 billion parameter model on a single A100 80GB GPU. Or fine-tune an 8 billion parameter model on an RTX 3060 12GB!

The bitsandbytes Library

bitsandbytes is the library that handles quantization. Installation is simple:

# Installation
# pip install bitsandbytes

# Verify installation
import bitsandbytes as bnb
print(bnb.__version__)

# Check CUDA
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")

Important Notes about bitsandbytes

Only works on NVIDIA GPUs (requires CUDA)
For Mac with Apple Silicon, use MLX instead
You need version 0.42+ for best performance

Full Training with QLoRA

from datasets import load_dataset

# 1. Load dataset
dataset = load_dataset("json", data_files="my_data.jsonl", split="train")

# 2. Format data
def format_chat(example):
    messages = [
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": example["question"]},
        {"role": "assistant", "content": example["answer"]},
    ]
    text = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=False
    )
    return {"text": text}

dataset = dataset.map(format_chat)

# 3. Training configuration
training_args = TrainingArguments(
    output_dir="./qlora-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,     # effective batch size = 16
    learning_rate=2e-4,
    warmup_ratio=0.03,
    logging_steps=10,
    save_strategy="epoch",
    fp16=False,
    bf16=True,                         # bfloat16 for computation
    optim="paged_adamw_8bit",          # Paged optimizer
    gradient_checkpointing=True,       # Extra memory savings
    max_grad_norm=0.3,
    lr_scheduler_type="cosine",
)

# 4. Start training
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=training_args,
    max_seq_length=2048,
    dataset_text_field="text",
)

trainer.train()

Suitable GPUs for QLoRA

With QLoRA, fine-tuning is no longer limited to expensive GPUs:

RTX 3060 12GB: 7-8B models (batch size=1)
RTX 3090 / 4090 24GB: 7-8B models comfortably, 13B with tuning
A100 40GB: Models up to 30B
A100 80GB: 70B models

# Suggest batch size based on GPU memory
def suggest_batch_size(gpu_vram_gb, model_size_b, quantization_bits=4):
    """Suggest batch size based on GPU and model"""
    # Estimate quantized model memory
    model_memory = model_size_b * quantization_bits / 8 / 1e9
    
    # Remaining memory for training
    available = gpu_vram_gb - model_memory - 2  # 2GB overhead
    
    if available < 2:
        return 1, 8  # batch_size=1, grad_accum=8
    elif available < 6:
        return 2, 4
    elif available < 12:
        return 4, 2
    else:
        return 8, 1

# Example
bs, ga = suggest_batch_size(gpu_vram_gb=24, model_size_b=8)
print(f"Batch size: {bs}, Gradient accumulation: {ga}")
# Output: Batch size: 4, Gradient accumulation: 2

QLoRA on Cloud Services

If you don't have a suitable GPU, you can use cloud services:

Google Colab (free): T4 16GB — for 7-8B models
Google Colab Pro: A100 40GB — for models up to 30B
RunPod / Lambda Labs: Choose your GPU at hourly rates
Kaggle: Free P100 16GB or T4 16GB

Does Quantization Reduce Quality?

Good question. When you reduce numerical precision, some information is lost. But research has shown:

The original QLoRA paper demonstrated that QLoRA (4-bit) performance is approximately equal to full fine-tuning (16-bit). The difference in most benchmarks is less than 1%.

The reason is that quantized weights are only used for the forward pass. Gradients and updates are computed at higher precision (bfloat16).

Important Tips

Use bnb_4bit_compute_dtype=torch.bfloat16 (not float16)
Enable gradient_checkpointing=True for extra memory savings
Use optim="paged_adamw_8bit" for the paged optimizer
If GPU memory is low, reduce batch_size and increase gradient_accumulation
A quantized model can't be merged directly — it needs to be dequantized first

Summary

QLoRA has made the democratization of fine-tuning a reality. Now anyone with a regular GPU can fine-tune large models. The combination of 4-bit quantization with LoRA dramatically reduces memory usage without sacrificing quality.

But the best model and the best tools are useless without good data. In the next episode, we'll explore the most important part of fine-tuning: dataset preparation.