Training Loss Progression

Interactive visualization of a typical LLM pretraining with phase transitions

5.0T
Phase 1 – Early Learning (0-2T):
Steep loss decline. Model learns basic patterns (syntax, common concepts)
Phase 2 – Deep Learning (2-8T):
Gradual loss decline. Learns complex concepts, abstractions, generalization

Loss Components

Training loss consists of multiple components that learn differently

Short-Range Tokens

Task: Predict next token
Difficulty: Medium
Learning: Fast → plateaus early

Long-Range Dependencies

Task: Use distant context
Difficulty: High
Learning: Slow → later improvement
Key Insights

🔑 Key Insights

Scaling Laws

Loss follows power laws: Loss ∝ N^(-α) with α ≈ 0.07 (Chinchilla Laws)

Chinchilla Optimum

Optimal ratio: ~20 tokens per parameter Scale size and data equally

Phase Transitions

Early Learning (steep) → Deep Learning (gradual) Indication for Emergent Abilities

Overfitting Risk

Large vocabulary + long sequences Less data diversity = higher risk

Validation Loss

Rises later than training loss Can serve as early stopping criterion

Downstream Performance

Not directly proportional to training loss Emergent abilities appear "suddenly"

Typical Training Parameters

Data Volume & Tokens

GPT-3: ~300B Tokens
GPT-3.5: ~1T Tokens
GPT-4: ~13T Tokens
Claude 3: ~4T Tokens

Hyperparameters

Learning Rate: 3e-4 → Cosine Decay
Batch Size: 1-4M Tokens
Warmup: 1-2% of steps
Weight Decay: 0.1