CHAPTER 6.1 · TRAINING

Training Loss Curves

How loss develops during pretraining and which phases are traversed

Training Loss Curves show the heartbeat of LLM training: How the model slowly learns to recognize patterns and understand language. The characteristic phases – from rapid initial learning to gradual plateau – reveal fundamental principles of deep learning.

📖 Learning Context ▼

Understand the typical phases of a training loss curve
Recognize overfitting vs. underfitting from curves
Grasp the significance of loss as a quality metric

Step 1/4 Training & Inference

Training fundamentals (1/4) lay the foundation for RLHF (2/4), Sampling (3/4), and Inference Optimization (4/4).

Loss curves are the primary diagnostic tool during training. They immediately show whether training is converging, stagnating, or unstable – indispensable for anyone training LLMs or analyzing their training.

Exponential Start: Loss drops quickly in the first epochs
Plateau Phase: Progress slows down, fine-tuning effects
Validation Gap: Divergence indicates overfitting

Training Loss Progression

Interactive visualization of a typical LLM pretraining with phase transitions

Tokens (Trillions): 5.0T

Phase 1 – Early Learning (0-2T):
Steep loss decline. Model learns basic patterns (syntax, common concepts)

Phase 2 – Deep Learning (2-8T):
Gradual loss decline. Learns complex concepts, abstractions, generalization

Loss Components

Training loss consists of multiple components that learn differently

Short-Range Tokens

Task: Predict next token

Difficulty: Medium

Learning: Fast → plateaus early

Long-Range Dependencies

Task: Use distant context

Difficulty: High

Learning: Slow → later improvement

Key Insights

🔑 Key Insights

Scaling Laws

Loss follows power laws: Loss ∝ N^(-α) with α ≈ 0.07 (Chinchilla Laws)

Chinchilla Optimum

Optimal ratio: ~20 tokens per parameter Scale size and data equally

Phase Transitions

Early Learning (steep) → Deep Learning (gradual) Indication for Emergent Abilities

Overfitting Risk

Large vocabulary + long sequences Less data diversity = higher risk

Validation Loss

Rises later than training loss Can serve as early stopping criterion

Downstream Performance

Not directly proportional to training loss Emergent abilities appear "suddenly"

Typical Training Parameters

Data Volume & Tokens

GPT-3: ~300B Tokens

GPT-3.5: ~1T Tokens

GPT-4: ~13T Tokens

Claude 3: ~4T Tokens

Hyperparameters

Learning Rate: 3e-4 → Cosine Decay

Batch Size: 1-4M Tokens

Warmup: 1-2% of steps

Weight Decay: 0.1