CHAPTER 3.1 · REASONING & TEST-TIME COMPUTE

Chain-of-Thought Demo

How explicit reasoning steps improve the accuracy of language models

Chain-of-Thought (CoT) is a prompting technique where the model articulates its thinking process step by step. This significantly improves accuracy on complex tasks — especially in mathematics and logical reasoning.

📖 Learning Context ▼

Understand and apply CoT prompting
Difference between Zero-Shot and Few-Shot CoT
When CoT helps and when it doesn't

Step 1/5 Reasoning & Test-Time Compute

CoT is the foundation for all modern reasoning approaches. From here, o1, DeepSeek R1, and others evolved.

CoT transforms an autoregressive LLM into a "thinker". Wei et al. (2022) showed: On GSM8K Math, accuracy increases from 18% to 57% — just by adding "Let's think step by step".

Explicit Steps: The model verbalizes its thinking path
Emergent: CoT only works from ~100B parameters onwards
Zero-Shot: "Let's think step by step" is often sufficient

The "Let's think step by step" Phenomenon

A simple technique can dramatically improve the performance of language models: asking the model to express its thoughts step by step. This is called Chain-of-Thought (CoT).

Without CoT, large models can make errors by answering too quickly. With CoT, accuracy improves, especially on reasoning tasks like mathematics, logic, and multi-step problems.

Important: The effect is much stronger with large models (100B+ parameters). For small models (under 10B), CoT helps less or not at all.

Choose a math problem:

Loading problem...

❌ Without Chain-of-Thought

The model answers directly without explaining...

Wrong ✗

✅ With Chain-of-Thought

The model works step by step...

Correct ✓

What is Chain-of-Thought?

Chain-of-Thought is a prompting technique where you ask the model to express its thoughts before giving an answer. This has several effects:

1. Slow Thinking: The model "thinks" through the problem instead of guessing.

2. Error Checking: When the model has to write out steps, it can notice and correct errors in the logic.

3. Explicit Deduction: The intermediate steps show the logic, not just the final result.

Zero-Shot vs Few-Shot CoT

Zero-Shot CoT: "Let's think step by step" — a magic prompt without examples

Problem: 3 × (4 + 2) - 5 = ?
Prompt: "Let's think step by step."

Output: Step 1: 4 + 2 = 6
        Step 2: 3 × 6 = 18
        Step 3: 18 - 5 = 13
            

Few-Shot CoT: Show examples of how to think

Problem: 3 × (4 + 2) - 5 = ?

Example:
Q: 2 × (3 + 5) + 1 = ?
A: Step 1: 3 + 5 = 8
   Step 2: 2 × 8 = 16
   Step 3: 16 + 1 = 17

Q: 3 × (4 + 2) - 5 = ?
A: (Model follows the example format)
            

Model Size vs CoT Effect

The chart above shows accuracy on mathematical benchmarks (like GSM8K) depending on model size.

Key Insights:

🔴 Small Models (<10B): CoT barely helps or not at all. Direct answers are sometimes better.
🟡 Medium Models (10B-100B): CoT starts helping, improvement of +5-15%
🟢 Large Models (100B+): CoT helps massively, often +20-40% improvement. GPT-3 (175B): 58% → 78% on GSM8K

Why does CoT only work with large models?

Hypothesis: Large models have learned that the intermediate steps they generate are valuable for reasoning. Small models haven't developed this capability.

In other words: CoT works because the model itself uses the steps to think better, not because the user sees them.

When should you use CoT?

Situation	Use CoT?	Why
Mathematical problems	✓ Yes	Multi-step reasoning is essential
Logic & Deduction	✓ Yes	Explicit argumentation helps
General QA	~ Maybe	Only helps when complex thinking is needed
Summarization	✗ No	No complex intermediate steps needed
Creative tasks	✗ No	Can constrain creativity
With small models (<10B)	✗ No	Model cannot reason meaningfully

Related Concepts

Self-Consistency: Generate multiple CoT chains and vote on the best result. Can improve accuracy by an additional +3-5%.

Least-to-Most Prompting: Break down complex problems into simpler subproblems. Solve them from simple to complex.

Comparison to Hidden Reasoning (o1/o3): CoT is explicit (user sees the steps), o1 is implicit (model thinks internally). o1 is often better, but also more expensive.

Chain-of-Thought Demo

Learning Objectives

Context: Where are we?

Why It Matters

Key Takeaways

The "Let's think step by step" Phenomenon

What is Chain-of-Thought?

Zero-Shot vs Few-Shot CoT

Model Size vs CoT Effect

Why does CoT only work with large models?

When should you use CoT?

Related Concepts

The "Let's think step by step" Phenomenon

What is Chain-of-Thought?

Zero-Shot vs Few-Shot CoT

Model Size vs CoT Effect

Why does CoT only work with large models?

When should you use CoT?

Related Concepts

Related Visualizations