Quantization Comparison

How different compression methods (FP32, FP16, INT8, FP8, INT4, FP4) change the trade-off between model size, speed, and quality

Quantization compresses LLMs from FP32 (full precision) to FP16, INT8, or even INT4 – with dramatic savings in memory and speed. The trade-off: Slight quality losses that vary by method.

📖 Learning Context ▼

Understand the different quantization levels (FP16, INT8, INT4)
Weigh trade-offs between size, speed, and quality
Learn practical deployment scenarios

Step 4/4 Training & Inference

After Training (1/4), RLHF (2/4), and Sampling (3/4), we come to Inference Optimization (4/4) – how to make models more efficient.

Quantization enables running 70B models on consumer GPUs. Without this technique, local LLM usage would be impractical for most users.

FP16: Standard for training, minimal quality loss
INT8: ~4× smaller, barely measurable degradation
INT4: ~8× smaller, noticeable but often acceptable degradation

Size vs Quality Trade-off

Training required (QAT)

Post-training possible (PTQ)

Edge/Mobile focus

Fig. 1 | Trade-off between model size and quality loss for different quantization methods. Larger bubbles indicate higher practical adoption.

Key Insights

FP32 is the baseline: All other methods compare against full precision. Modern training already uses FP16/BF16 for efficiency.

FP16 is practically lossless: With only 2 bytes per value, FP16 reduces memory by 50% without noticeable quality loss. The standard for cloud inference.

INT8 is the older standard method: Post-Training Quantization (PTQ) possible without retraining, but FP8 outperforms INT8 on modern Transformers.

FP8 is the modern choice: On newer hardware (NVIDIA H100, TPU v5e), FP8 is the optimal compromise: 8 bits, better quality ratio than INT8, barely any overhead.

INT4/FP4 are edge-focused: Extreme compression (16× less memory), but requires calibration and LoRA finetunes. Practical for smartphones, but with quality losses.

Quantization-Aware Training beats PTQ: QAT (with retraining) yields better quality but costs training time. In practice, mostly PTQ for fast deployment.

Quantization Comparison

Learning Objectives

Context: Where are we?

Why It Matters

Key Takeaways

Quantization Methods in Detail

Detailed Overview

Size vs Quality Trade-off

Key Insights