CHAPTER 8.1e · IN-CONTEXT LEARNING

Induction Heads

The attention circuits that enable LLMs to learn from prompt examples - the technical foundation for In-Context Learning

Induction Heads are specialized attention circuits that enable pattern completion. They are the reason why LLMs can learn from a few examples in the prompt (In-Context Learning).

📖 Learning Context ▼

Understand what Induction Heads are and how they work
Know the two-step logic (Previous Token Head + Induction Head)
Connect to In-Context Learning and Few-Shot Prompting

Step 5/5 Chapter 8: Tools & Glossary

Technical foundations as reference.

Induction Heads explain why LLMs can "learn" without changing their weights. They are a key concept for understanding emergent capabilities.

Induction Heads emerge through the interplay of multiple attention heads
Pattern: "A B ... A" → Model expects "B" as the next token
More Induction Heads = better In-Context Learning

What are Induction Heads?

Induction Heads are specialized attention circuits in the Transformer, that implement a fundamental capability: Pattern Completion.

They are the technical foundation for In-Context Learning (ICL) - the remarkable ability of LLMs to learn from examples in a prompt, without the model being retrained.

Animation Speed: 1.0x

Fig. 1 | Pattern Completion through Induction Heads: Sequence [A][B]...[A][ ]. The model recognized a pattern: After A comes B. So B is predicted. The slider controls animation speed.

The Pattern: [A][B]...[A] → [B]

When an Induction Head sees a Token it has seen before, it "remembers" what came after - and predicts exactly that.

The Induction Head Pattern

Position i recognizes: "I've seen this token before!"
Lookup: "At position j (earlier), after this token came [B]"
Prediction: "So here [B] should come too"

Mathematically: If Token(i) = Token(j) → Prediction Token(i+1) = Token(j+1)

Pattern Matching

IH recognize when a token is repeated and can predict the next token in the sequence.

In-Context Learning

IH enable models to recognize and apply new patterns in prompts - without retraining.

Emergent Phenomenon

ICL is not explicitly trained. IH emerge spontaneously during training as a byproduct.

The Mechanism: Two-Layer Circuit

Induction Heads don't work as a single attention operation. They require a two-layer composition:

Fig. 2 | Two-Layer Induction Head Circuit: Layer 1 copies information from the previous token. Layer 2 uses this to complete the pattern.

Layer 1: Previous-Token Head

Job: Copy information from the previous token (t-1) to the current token (t)
Attention pattern: Each token attends to the token before it
Function: Establishes context - "what came directly before"

Layer 2: Induction Head

Job: Pattern completion - uses information from Layer 1
Attention pattern: Attends to tokens with similar context (from Layer 1)
Function: "If I've seen this context before, what followed then?"

Two-Layer Schematic

Layer 1 (Previous-Token Head):
Attn1(Q, K, V) → Output focused on t-1

Layer 2 (Induction Head):
Attn2(Q, K_new, V_new) where K_new comes from Layer 1 output
→ Can perform pattern matching

Result: Full Pattern Completion [A][B]...[A] → [B] possible

Why are 2 Layers Needed?

A single layer can only process direct neighbors. To recognize the pattern [A][B]...[A] → [B], the model must:

Fetch information about the earlier context (Layer 1)
Compare this with the current position (Layer 2)
Select the appropriate output (Layer 2 output)

With 2+ Layers

Induction Heads form. In-Context Learning capability emerges. Model can generalize patterns.

With Only 1 Layer

No Induction Heads possible. No significant ICL. Model can only use directly adjacent tokens.

Training: The Phase Change of Induction Heads

Induction Heads don't emerge gradually during training. Instead, there is a dramatic phase change - a well-observable moment where the capability suddenly appears.

Training Epoch: 5.0B Tokens

Fig. 3 | Training loss curve with phase change: At approximately 2.5-5 billion tokens, a clear "bump" occurs - the sign of Induction Head emergence. The slider shows position in training progress.

The Bump in the Loss Curve

Phase Change Characteristic

Timing: 2.5 - 5 billion training tokens
Signal: Clear "bump" visible in the loss curve
Nature: Not gradual change, but discrete transition

Interpretation:
- Model reaches critical complexity
- Induction Heads "click" into place
- ICL capability emerges suddenly
- Loss improves sharply afterwards

What Happens at the Phase Change?

Before: Model uses only local patterns. No Induction Heads. No ICL.
Phase Change: Induction Head circuit forms. Loss "bumps" briefly.
After: Model has ICL. Can learn from prompts. Better generalization.

Timing is Early!

Note: 2.5-5B tokens is relatively early in training. Large models train on trillions of tokens. This means: ICL is a fundamental capability that develops quickly.

In-Context Learning: How Induction Heads Enable It

In-Context Learning is the ability of LLMs to learn new tasks from a few examples, without the model being retrained. Induction Heads are the circuits behind it.

Example: Few-Shot Pattern Learning

User: Translate English → German
Example 1: "hello" → "hallo"
Example 2: "goodbye" → "auf wiedersehen"
New Input: "thank you" → ?

IH recognizes: [English][German]...[English] → [German]
Output: "danke"
            

The model never saw in its training set that it should translate "thank you". But the Induction Heads recognize the pattern in the examples and generalize correctly.

Mechanism: Pattern Matching in Prompts

Prompt contains examples: [Task Description] [Input A] [Output A] [Input B] [Output B] ...
Induction Heads recognize the repetition: [Input] → [Output] pattern
When [Input C] is added, IH knows: Now [Output C]
Model generates the appropriate output

Fig. 4 | In-Context Learning in action: Prompt with examples, then new input. Induction Heads recognize the [Example] → [Output] pattern and generalize to the new input.

ICL is Emergent!

Emergent Property

In-Context Learning is not explicitly trained.

Training process:
1. Model trained on billions of tokens (standard LM loss)
2. Model sees diverse text examples, repetitions, patterns
3. As a byproduct: Induction Heads emerge
4. Byproduct enables: Pattern Completion
5. Pattern Completion enables: ICL

Result: Model can do ICL, although this was never part of the loss!

Limitations of Induction Heads

The Critical Limitation: At Least 2 Layers Needed

Single-Layer Models

Cannot form Induction Heads. A single layer is not sufficient for the Two-Layer Circuit. Consequence: No significant ICL possible.

Design Implication

Minimum network depth is required for ICL. Shallow models have a fundamental limitation.

Architecture	Layers	IH Possible?	ICL Quality	Use Case
Shallow Transformer	1 Layer	No	None	NLP Toys
Standard Transformer	2-4 Layers	Yes (weak)	Weak	Small Models
Modern LLM	20-80 Layers	Yes (strong)	Strong	Production

Other Limitations (researched but not strongly documented)

Context Length: IH can only recognize patterns in the KV-Cache. Outside the window is a blind spot.
Pattern Clarity: If the pattern is unclear, IH cannot generalize.
Model Size: Larger models have stronger IH. Small models may have weak ICL.

Practical Examples for Induction Heads

Example 1: Code Completion

# Pattern in prompt:
def add(a, b):
    return a + b

def subtract(a, b):
    return a - b

def multiply(a, b):

# IH recognizes: Function → Implementation pattern
# Output:
    return a * b
            

Example 2: Language Translation

English: The weather is nice.
German: Das Wetter ist schoen.

English: I love programming.
German:

# Pattern: English → German
# IH completes: Ich liebe Programmieren.
            

Example 3: Format Understanding

JSON Format Examples:
{"name": "Alice", "age": 30}
{"name": "Bob", "age": 25}

New Input:
{"name": "Charlie", "age":

# Pattern: Name → Age Structure
# IH predicts: 35} (e.g.)
            

In all these cases, the model never explicitly learned to generate code, or write German. It simply recognizes the pattern [X][Y]...[X] → [Y] and generalizes.

Key Insights

1. Two-Layer Circuit

Induction Heads are not a single attention operation, but a composition of two layers: Previous-Token + Pattern Matching.

2. Emergent Phenomenon

ICL is not trained. It's a byproduct of LM Pretraining. Induction Heads emerge spontaneously at 2.5-5B tokens.

3. Phase Change

The emergence is not gradual - there is a clear phase change with a recognizable bump in the loss curve.

4. Depth is Essential

Single-layer models cannot form IH. At least 2 layers are required for ICL - a fundamental constraint.

5. Pattern Completion

The core function: [A][B]...[A] → [B]. Simple but powerful mechanism for most ICL tasks.

6. Interpretability

Induction Heads are an example of mechanistic interpretability - we can literally see and understand the circuits in the model.