How different compression methods (FP32, FP16, INT8, FP8, INT4, FP4) change the trade-off between model size, speed, and quality
After Training (1/4), RLHF (2/4), and Sampling (3/4), we come to Inference Optimization (4/4) – how to make models more efficient.
Quantization enables running 70B models on consumer GPUs. Without this technique, local LLM usage would be impractical for most users.
| Quantization | Bits | Size (70B) | Speedup | Quality Loss | Training | Primary Use Case |
|---|