AI Hardware — Why GPU Is the Hero

In the previous episode, we drew a general map of the AI world — understanding how AI, ML, DL, and LLM relate to each other and why everyone is talking about LLMs now. Now it is time to address a very practical question: what hardware handles all these heavy computations?

CPU vs GPU — The Genius vs the Army of Workers

CPU (Central Processing Unit) is like a genius mathematician. Incredibly smart, capable of solving very complex problems, but there is only one (or at most a few — its core count). It does everything with high precision and speed, but one thing at a time.

GPU (Graphics Processing Unit) is like a large army of simple workers. Each one only knows one simple math operation — like adding two numbers. But there are thousands of them working simultaneously.

If you need to solve 10 differential equations, CPU is excellent. But if you need to perform ten million simple additions and multiplications simultaneously, GPU is thousands of times faster. And guess what AI mostly does? Yes, millions of simple, repetitive math operations.

Real Numbers

A modern CPU has about 8 to 32 cores
A modern GPU has about 5,000 to 16,000+ CUDA cores

GPU cores are much simpler than CPU cores. But when your task involves millions of similar parallel operations, quantity wins.

Why Matrices?

Neural networks at their core just perform matrix multiplications. Each layer multiplies input by a weight matrix and produces output. This operation is extremely parallelizable — exactly what GPUs are built for.

VRAM — The Most Important GPU Spec for AI

When buying or renting a GPU for AI work, one number matters most: VRAM (Video RAM).

To run an AI model, the entire model must fit in VRAM. Think of it as a desk: the model is the book you are using, VRAM is the desk size. If the book is bigger than the desk, it does not fit.

Memory requirements (without Quantization):

7B parameter model (e.g., Llama 3 8B): ~14-16 GB VRAM
13B parameter model: ~26-30 GB VRAM
70B parameter model: ~140+ GB VRAM

Rule of thumb: each billion parameters needs about 2 GB VRAM (at FP16 precision).

Quantization — Running Big Models on Small Hardware

Quantization reduces model precision to shrink its memory footprint. Like reducing image quality from 4K to 1080p — the file gets much smaller but still looks good.

FP16 (16-bit) — Original quality. 2 bytes per parameter.
INT8 (8-bit) — Half the size. Nearly unchanged quality.
INT4 (4-bit) — Quarter the size. Slight quality loss but still very good.

A 7B model at INT4 needs only 3.5 GB VRAM — runnable on a laptop GPU!

Practical Note

For most practical work, 4-bit quantization (like Q4_K_M in GGUF format) provides an excellent balance between quality and memory usage. Research shows quality loss is typically less than 5% compared to FP16.

Practical GPU Recommendations

Free Start

Google Colab (Free) — Free T4 GPU with 15 GB VRAM
Kaggle Notebooks — Similar to Colab
Free APIs — Groq and Cloudflare Workers AI offer limited free requests

Limited Budget (Personal Computer)

NVIDIA GPU with 8+ GB VRAM — Can run 7B models with 4-bit quantization
Apple Silicon (M1/M2/M3/M4) — Unified Memory makes these surprisingly capable. An M2 Pro with 32 GB can run 13B models.

Cloud GPU Rental

Google Colab Pro — ~$10/month
RunPod / Vast.ai — Hourly GPU rental, $0.2-$3/hour
Lambda Labs — Professional A100 and H100 GPUs

Buying a GPU

RTX 4060 (8GB) — ~$300. Entry level, 7B models with Q4.
RTX 4070 Ti Super (16GB) — ~$800. 13B models with Q4. Best value.
RTX 4090 (24GB) — ~$1,600. 30B+ models with Q4.

Golden rule: Always buy the most VRAM your budget allows. You can compensate for processing speed with patience, but insufficient VRAM means the model simply will not run.

Training vs Inference

Training — Building a model from scratch or fine-tuning it. Very heavy. Usually requires multiple professional GPUs.
Inference — Using a ready model. Much lighter. Doable with a regular GPU.

Good news: as developers, we mostly do Inference. Training from scratch is for big companies like OpenAI and Meta.

Summary

CPU is a genius but alone — GPU is an army of simple workers operating simultaneously
AI mostly performs matrix operations that are extremely parallelizable — GPU is up to 100x faster
VRAM is the most important GPU spec for AI
Quantization lets you run large models on small hardware
To start, even a laptop with 8 GB VRAM or an Apple Silicon MacBook is enough