The "Let's think step by step" Phenomenon

A simple technique can dramatically improve the performance of language models: asking the model to express its thoughts step by step. This is called Chain-of-Thought (CoT).

Without CoT, large models can make errors by answering too quickly. With CoT, accuracy improves, especially on reasoning tasks like mathematics, logic, and multi-step problems.

Important: The effect is much stronger with large models (100B+ parameters). For small models (under 10B), CoT helps less or not at all.

Loading problem...
❌ Without Chain-of-Thought
The model answers directly without explaining...
15
Wrong ✗
✅ With Chain-of-Thought
The model works step by step...
22
Correct ✓

What is Chain-of-Thought?

Chain-of-Thought is a prompting technique where you ask the model to express its thoughts before giving an answer. This has several effects:

1. Slow Thinking: The model "thinks" through the problem instead of guessing.

2. Error Checking: When the model has to write out steps, it can notice and correct errors in the logic.

3. Explicit Deduction: The intermediate steps show the logic, not just the final result.

Zero-Shot vs Few-Shot CoT

Zero-Shot CoT: "Let's think step by step" — a magic prompt without examples

Problem: 3 × (4 + 2) - 5 = ? Prompt: "Let's think step by step." Output: Step 1: 4 + 2 = 6 Step 2: 3 × 6 = 18 Step 3: 18 - 5 = 13

Few-Shot CoT: Show examples of how to think

Problem: 3 × (4 + 2) - 5 = ? Example: Q: 2 × (3 + 5) + 1 = ? A: Step 1: 3 + 5 = 8 Step 2: 2 × 8 = 16 Step 3: 16 + 1 = 17 Q: 3 × (4 + 2) - 5 = ? A: (Model follows the example format)

Model Size vs CoT Effect

The chart above shows accuracy on mathematical benchmarks (like GSM8K) depending on model size.

Key Insights:

Why does CoT only work with large models?

Hypothesis: Large models have learned that the intermediate steps they generate are valuable for reasoning. Small models haven't developed this capability.

In other words: CoT works because the model itself uses the steps to think better, not because the user sees them.

When should you use CoT?

Situation Use CoT? Why
Mathematical problems ✓ Yes Multi-step reasoning is essential
Logic & Deduction ✓ Yes Explicit argumentation helps
General QA ~ Maybe Only helps when complex thinking is needed
Summarization ✗ No No complex intermediate steps needed
Creative tasks ✗ No Can constrain creativity
With small models (<10B) ✗ No Model cannot reason meaningfully

Related Concepts

Self-Consistency: Generate multiple CoT chains and vote on the best result. Can improve accuracy by an additional +3-5%.

Least-to-Most Prompting: Break down complex problems into simpler subproblems. Solve them from simple to complex.

Comparison to Hidden Reasoning (o1/o3): CoT is explicit (user sees the steps), o1 is implicit (model thinks internally). o1 is often better, but also more expensive.