Model Size vs CoT Effect

Why Chain-of-Thought only works with larger models: A critical threshold at ~100 billion parameters

Emergence of CoT: Chain-of-Thought prompting shows a critical threshold — only from ~100B parameters onwards does CoT outperform the direct-answer method. Smaller models benefit little or even perform worse.

📖 Learning Context ▼

Understand emergent abilities in LLMs
Know the critical threshold for CoT
Recognize inverse scaling in small models

Step 1/5 Reasoning & Test-Time Compute

Complements the CoT demo with the important insight: Not every model benefits from CoT.

For small models (<10B), CoT can actually decrease performance — the reasoning chain contains errors that propagate. The choice of prompting strategy depends on model size.

~100B Threshold: CoT only emerges in large models
Inverse Scaling: Small models get worse with CoT
Task-dependent: Task difficulty influences the threshold

Fig. 1 | Chain-of-Thought effect by model size. Two lines: With CoT (blue) and Without CoT (gray). The curves diverge only at ~100B parameters. Dark gray marks the effective threshold.

⚠️ The Critical Threshold

Chain-of-Thought shows significant effect from a model size of approximately 100 billion parameters. Smaller models produce unreliable or even misleading reasoning steps. This is often described as "emergence" of reasoning capabilities.

Why does CoT only work with large models?

Complex reasoning requires capacity: To generate intermediate steps and then use them for the final answer, the model needs enough parameters to represent complex logic. Small models don't have enough "memory" for multi-step reasoning.

Phase-change at scale: Wei et al. (2022) showed that many abilities (especially reasoning) emerge in a non-linear "phase-change" at certain model sizes. CoT is the paradigmatic example of this emergence.

Data quality is secondary: Small models do NOT benefit from CoT, even if the training data contains CoT examples. The models simply cannot internalize the pattern. Size is the primary variable.

Local vs Global Reasoning: Small models can predict local patterns (next token). But they cannot plan "globally": Step 1 → Step 2 → Step 3 → Solution. This requires hierarchical reasoning.

Fine-tuning doesn't help: You can fine-tune smaller models with CoT data, but they won't get significantly better. They just become "better at outputting CoT strings" – but the actual reasoning quality remains low.

Implication for practitioners: For small models (< 50B), you should skip CoT and instead focus on direct few-shot learning, template-based prompts, or retrieval. CoT is a waste of context tokens.

Empirical Data: Wei et al. (2022)

From the original paper "Emergent Abilities of Large Language Models". The table shows CoT accuracy on various benchmarks and model sizes.

Model & Size	MATH (Without CoT)	MATH (With CoT)	CoT Gain	Effective?
PaLM 8B	2%	2%	+0%	❌ No
PaLM 62B	4%	4%	+0%	❌ No
PaLM 540B	8%	56%	+48%	✅ Yes!
GPT-3 175B	17%	71%	+54%	✅ Yes!

Practical Implications

→

For 7B/13B models (e.g., Llama): CoT is not recommended. Use direct prompts, short few-shot examples, or specialized fine-tuning instead.

→

For 70B models (e.g., Llama 2/3 70B): CoT can help, but not guaranteed. Experiment and measure. Often structured prompting (XML tags, templates) helps more.

→

For 100B+ models (GPT-4, Claude 3): CoT is very effective. "Let's think step by step" is a safe, reliable technique for complex tasks.

→

Test-Time Compute: Instead of CoT for small models, use parallel samples (Best-of-N) or other test-time techniques. These are model-size agnostic.

Model Size vs CoT Effect

Learning Objectives

Context: Where are we?

Why It Matters

Key Takeaways

Why does CoT only work with large models?

Empirical Data: Wei et al. (2022)

Practical Implications