How loss develops during pretraining and which phases are traversed
Training Loss Curves show the heartbeat of LLM training: How the model slowly learns to recognize patterns and understand language. The characteristic phases – from rapid initial learning to gradual plateau – reveal fundamental principles of deep learning.
Training fundamentals (1/4) lay the foundation for RLHF (2/4), Sampling (3/4), and Inference Optimization (4/4).
Loss curves are the primary diagnostic tool during training. They immediately show whether training is converging, stagnating, or unstable – indispensable for anyone training LLMs or analyzing their training.
Interactive visualization of a typical LLM pretraining with phase transitions
Training loss consists of multiple components that learn differently
Loss follows power laws: Loss ∝ N^(-α) with α ≈ 0.07 (Chinchilla Laws)
Optimal ratio: ~20 tokens per parameter Scale size and data equally
Early Learning (steep) → Deep Learning (gradual) Indication for Emergent Abilities
Large vocabulary + long sequences Less data diversity = higher risk
Rises later than training loss Can serve as early stopping criterion
Not directly proportional to training loss Emergent abilities appear "suddenly"