Fig. 1 | Sankey diagram of compute allocation. Flow width shows the percentage of compute budget. Choose between 3 scenarios on the left.

Three Allocation Strategies

Key Insights

1
The Chinchilla Question: Historically, the wisdom was "Scale parameters and data equally". But Snell et al. (2024) show: With Test-Time Compute, you can deviate from this rule.
2
Pre-Training is the Foundation: Without good Pre-Training, Test-Time Compute doesn't help. You need a fundamentally competent model. Minimum ~70% of budget should go to Pre-Training.
3
Fine-Tuning is Optional: Modern research shows: In-Context Learning (Few-Shot) is often better than Fine-Tuning. You can skip Fine-Tuning entirely and instead boost Pre-Training or Test-Time.
4
Test-Time Leverage: With 20% of budget on Test-Time (Best-of-N, Reasoning, Verification) you can achieve performance of a 14× larger model. This is the latest discovery from o1/o3 research.
5
Task-Dependent: The optimal split depends on task type. For Factual QA: Pre-Training Heavy. For Reasoning: Test-Time Heavy. For Language Generation: Balanced.
6
Practical Implication: At large companies, models are often Pre-Trained in batches, then SFT/RLHF once, then deployed for various tasks with different Test-Time budgets. This allows flexibility without retraining.

Mathematical Background

Total Compute Budget = C

Allocation:
C_pre = α × C (Pre-Training)
C_ft = β × C (Fine-Tuning)
C_test = (1-α-β) × C (Test-Time)

Snell et al. Finding:
Optimal Performance ∝ C_pre^0.5 × C_test^0.5

Not: C_pre^1 (which was previously assumed)