Quantization Methods in Detail

Quantization Bits Size (70B) Speedup Quality Loss Training Primary Use Case

Detailed Overview

Size vs Quality Trade-off

Training required (QAT)
Post-training possible (PTQ)
Edge/Mobile focus
Fig. 1 | Trade-off between model size and quality loss for different quantization methods. Larger bubbles indicate higher practical adoption.
Key Insights

Key Insights

1
FP32 is the baseline: All other methods compare against full precision. Modern training already uses FP16/BF16 for efficiency.
2
FP16 is practically lossless: With only 2 bytes per value, FP16 reduces memory by 50% without noticeable quality loss. The standard for cloud inference.
3
INT8 is the older standard method: Post-Training Quantization (PTQ) possible without retraining, but FP8 outperforms INT8 on modern Transformers.
4
FP8 is the modern choice: On newer hardware (NVIDIA H100, TPU v5e), FP8 is the optimal compromise: 8 bits, better quality ratio than INT8, barely any overhead.
5
INT4/FP4 are edge-focused: Extreme compression (16× less memory), but requires calibration and LoRA finetunes. Practical for smartphones, but with quality losses.
6
Quantization-Aware Training beats PTQ: QAT (with retraining) yields better quality but costs training time. In practice, mostly PTQ for fast deployment.