Why Chain-of-Thought only works with larger models: A critical threshold at ~100 billion parameters
Emergence of CoT: Chain-of-Thought prompting shows a critical threshold — only from ~100B parameters onwards does CoT outperform the direct-answer method. Smaller models benefit little or even perform worse.
Complements the CoT demo with the important insight: Not every model benefits from CoT.
For small models (<10B), CoT can actually decrease performance — the reasoning chain contains errors that propagate. The choice of prompting strategy depends on model size.
From the original paper "Emergent Abilities of Large Language Models". The table shows CoT accuracy on various benchmarks and model sizes.
| Model & Size | MATH (Without CoT) | MATH (With CoT) | CoT Gain | Effective? |
|---|---|---|---|---|
| PaLM 8B | 2% | 2% | +0% | ❌ No |
| PaLM 62B | 4% | 4% | +0% | ❌ No |
| PaLM 540B | 8% | 56% | +48% | ✅ Yes! |
| GPT-3 175B | 17% | 71% | +54% | ✅ Yes! |