Fig. 1 | Chain-of-Thought effect by model size. Two lines: With CoT (blue) and Without CoT (gray). The curves diverge only at ~100B parameters. Dark gray marks the effective threshold.
⚠️ The Critical Threshold
Chain-of-Thought shows significant effect from a model size of approximately 100 billion parameters. Smaller models produce unreliable or even misleading reasoning steps. This is often described as "emergence" of reasoning capabilities.

Why does CoT only work with large models?

1
Complex reasoning requires capacity: To generate intermediate steps and then use them for the final answer, the model needs enough parameters to represent complex logic. Small models don't have enough "memory" for multi-step reasoning.
2
Phase-change at scale: Wei et al. (2022) showed that many abilities (especially reasoning) emerge in a non-linear "phase-change" at certain model sizes. CoT is the paradigmatic example of this emergence.
3
Data quality is secondary: Small models do NOT benefit from CoT, even if the training data contains CoT examples. The models simply cannot internalize the pattern. Size is the primary variable.
4
Local vs Global Reasoning: Small models can predict local patterns (next token). But they cannot plan "globally": Step 1 → Step 2 → Step 3 → Solution. This requires hierarchical reasoning.
5
Fine-tuning doesn't help: You can fine-tune smaller models with CoT data, but they won't get significantly better. They just become "better at outputting CoT strings" – but the actual reasoning quality remains low.
6
Implication for practitioners: For small models (< 50B), you should skip CoT and instead focus on direct few-shot learning, template-based prompts, or retrieval. CoT is a waste of context tokens.

Empirical Data: Wei et al. (2022)

From the original paper "Emergent Abilities of Large Language Models". The table shows CoT accuracy on various benchmarks and model sizes.

Model & Size MATH (Without CoT) MATH (With CoT) CoT Gain Effective?
PaLM 8B 2% 2% +0% ❌ No
PaLM 62B 4% 4% +0% ❌ No
PaLM 540B 8% 56% +48% ✅ Yes!
GPT-3 175B 17% 71% +54% ✅ Yes!

Practical Implications

For 7B/13B models (e.g., Llama): CoT is not recommended. Use direct prompts, short few-shot examples, or specialized fine-tuning instead.
For 70B models (e.g., Llama 2/3 70B): CoT can help, but not guaranteed. Experiment and measure. Often structured prompting (XML tags, templates) helps more.
For 100B+ models (GPT-4, Claude 3): CoT is very effective. "Let's think step by step" is a safe, reliable technique for complex tasks.
Test-Time Compute: Instead of CoT for small models, use parallel samples (Best-of-N) or other test-time techniques. These are model-size agnostic.