How explicit reasoning steps improve the accuracy of language models
Chain-of-Thought (CoT) is a prompting technique where the model articulates its thinking process step by step. This significantly improves accuracy on complex tasks — especially in mathematics and logical reasoning.
CoT is the foundation for all modern reasoning approaches. From here, o1, DeepSeek R1, and others evolved.
CoT transforms an autoregressive LLM into a "thinker". Wei et al. (2022) showed: On GSM8K Math, accuracy increases from 18% to 57% — just by adding "Let's think step by step".
A simple technique can dramatically improve the performance of language models: asking the model to express its thoughts step by step. This is called Chain-of-Thought (CoT).
Without CoT, large models can make errors by answering too quickly. With CoT, accuracy improves, especially on reasoning tasks like mathematics, logic, and multi-step problems.
Important: The effect is much stronger with large models (100B+ parameters). For small models (under 10B), CoT helps less or not at all.
Chain-of-Thought is a prompting technique where you ask the model to express its thoughts before giving an answer. This has several effects:
1. Slow Thinking: The model "thinks" through the problem instead of guessing.
2. Error Checking: When the model has to write out steps, it can notice and correct errors in the logic.
3. Explicit Deduction: The intermediate steps show the logic, not just the final result.
Zero-Shot CoT: "Let's think step by step" — a magic prompt without examples
Few-Shot CoT: Show examples of how to think
The chart above shows accuracy on mathematical benchmarks (like GSM8K) depending on model size.
Key Insights:
Hypothesis: Large models have learned that the intermediate steps they generate are valuable for reasoning. Small models haven't developed this capability.
In other words: CoT works because the model itself uses the steps to think better, not because the user sees them.
| Situation | Use CoT? | Why |
|---|---|---|
| Mathematical problems | ✓ Yes | Multi-step reasoning is essential |
| Logic & Deduction | ✓ Yes | Explicit argumentation helps |
| General QA | ~ Maybe | Only helps when complex thinking is needed |
| Summarization | ✗ No | No complex intermediate steps needed |
| Creative tasks | ✗ No | Can constrain creativity |
| With small models (<10B) | ✗ No | Model cannot reason meaningfully |
Self-Consistency: Generate multiple CoT chains and vote on the best result. Can improve accuracy by an additional +3-5%.
Least-to-Most Prompting: Break down complex problems into simpler subproblems. Solve them from simple to complex.
Comparison to Hidden Reasoning (o1/o3): CoT is explicit (user sees the steps), o1 is implicit (model thinks internally). o1 is often better, but also more expensive.