In the previous episode, we drew a general map of the AI world — understanding how AI, ML, DL, and LLM relate to each other and why everyone is talking about LLMs now. Now it is time to address a very practical question: what hardware handles all these heavy computations?
CPU vs GPU — The Genius vs the Army of Workers
CPU (Central Processing Unit) is like a genius mathematician. Incredibly smart, capable of solving very complex problems, but there is only one (or at most a few — its core count). It does everything with high precision and speed, but one thing at a time.
GPU (Graphics Processing Unit) is like a large army of simple workers. Each one only knows one simple math operation — like adding two numbers. But there are thousands of them working simultaneously.
If you need to solve 10 differential equations, CPU is excellent. But if you need to perform ten million simple additions and multiplications simultaneously, GPU is thousands of times faster. And guess what AI mostly does? Yes, millions of simple, repetitive math operations.
Real Numbers
- A modern CPU has about 8 to 32 cores
- A modern GPU has about 5,000 to 16,000+ CUDA cores
GPU cores are much simpler than CPU cores. But when your task involves millions of similar parallel operations, quantity wins.
VRAM — The Most Important GPU Spec for AI
When buying or renting a GPU for AI work, one number matters most: VRAM (Video RAM).
To run an AI model, the entire model must fit in VRAM. Think of it as a desk: the model is the book you are using, VRAM is the desk size. If the book is bigger than the desk, it does not fit.
Memory requirements (without Quantization):
- 7B parameter model (e.g., Llama 3 8B): ~14-16 GB VRAM
- 13B parameter model: ~26-30 GB VRAM
- 70B parameter model: ~140+ GB VRAM
Rule of thumb: each billion parameters needs about 2 GB VRAM (at FP16 precision).
Quantization — Running Big Models on Small Hardware
Quantization reduces model precision to shrink its memory footprint. Like reducing image quality from 4K to 1080p — the file gets much smaller but still looks good.
- FP16 (16-bit) — Original quality. 2 bytes per parameter.
- INT8 (8-bit) — Half the size. Nearly unchanged quality.
- INT4 (4-bit) — Quarter the size. Slight quality loss but still very good.
A 7B model at INT4 needs only 3.5 GB VRAM — runnable on a laptop GPU!
Practical GPU Recommendations
Free Start
- Google Colab (Free) — Free T4 GPU with 15 GB VRAM
- Kaggle Notebooks — Similar to Colab
- Free APIs — Groq and Cloudflare Workers AI offer limited free requests
Limited Budget (Personal Computer)
- NVIDIA GPU with 8+ GB VRAM — Can run 7B models with 4-bit quantization
- Apple Silicon (M1/M2/M3/M4) — Unified Memory makes these surprisingly capable. An M2 Pro with 32 GB can run 13B models.
Cloud GPU Rental
- Google Colab Pro — ~$10/month
- RunPod / Vast.ai — Hourly GPU rental, $0.2-$3/hour
- Lambda Labs — Professional A100 and H100 GPUs
Buying a GPU
- RTX 4060 (8GB) — ~$300. Entry level, 7B models with Q4.
- RTX 4070 Ti Super (16GB) — ~$800. 13B models with Q4. Best value.
- RTX 4090 (24GB) — ~$1,600. 30B+ models with Q4.
Golden rule: Always buy the most VRAM your budget allows. You can compensate for processing speed with patience, but insufficient VRAM means the model simply will not run.
Training vs Inference
- Training — Building a model from scratch or fine-tuning it. Very heavy. Usually requires multiple professional GPUs.
- Inference — Using a ready model. Much lighter. Doable with a regular GPU.
Good news: as developers, we mostly do Inference. Training from scratch is for big companies like OpenAI and Meta.
Summary
- CPU is a genius but alone — GPU is an army of simple workers operating simultaneously
- AI mostly performs matrix operations that are extremely parallelizable — GPU is up to 100x faster
- VRAM is the most important GPU spec for AI
- Quantization lets you run large models on small hardware
- To start, even a laptop with 8 GB VRAM or an Apple Silicon MacBook is enough