What are Induction Heads?

Induction Heads are specialized attention circuits in the Transformer, that implement a fundamental capability: Pattern Completion.

They are the technical foundation for In-Context Learning (ICL) - the remarkable ability of LLMs to learn from examples in a prompt, without the model being retrained.

1.0x
Fig. 1 | Pattern Completion through Induction Heads: Sequence [A][B]...[A][ ]. The model recognized a pattern: After A comes B. So B is predicted. The slider controls animation speed.

The Pattern: [A][B]...[A] β†’ [B]

When an Induction Head sees a Token it has seen before, it "remembers" what came after - and predicts exactly that.

The Induction Head Pattern
Position i recognizes: "I've seen this token before!"
Lookup: "At position j (earlier), after this token came [B]"
Prediction: "So here [B] should come too"

Mathematically: If Token(i) = Token(j) β†’ Prediction Token(i+1) = Token(j+1)

Pattern Matching

IH recognize when a token is repeated and can predict the next token in the sequence.

In-Context Learning

IH enable models to recognize and apply new patterns in prompts - without retraining.

Emergent Phenomenon

ICL is not explicitly trained. IH emerge spontaneously during training as a byproduct.

The Mechanism: Two-Layer Circuit

Induction Heads don't work as a single attention operation. They require a two-layer composition:

Fig. 2 | Two-Layer Induction Head Circuit: Layer 1 copies information from the previous token. Layer 2 uses this to complete the pattern.

Layer 1: Previous-Token Head

Layer 2: Induction Head

Two-Layer Schematic
Layer 1 (Previous-Token Head):
Attn1(Q, K, V) β†’ Output focused on t-1

Layer 2 (Induction Head):
Attn2(Q, K_new, V_new) where K_new comes from Layer 1 output
β†’ Can perform pattern matching

Result: Full Pattern Completion [A][B]...[A] β†’ [B] possible

Why are 2 Layers Needed?

A single layer can only process direct neighbors. To recognize the pattern [A][B]...[A] β†’ [B], the model must:

With 2+ Layers

Induction Heads form. In-Context Learning capability emerges. Model can generalize patterns.

With Only 1 Layer

No Induction Heads possible. No significant ICL. Model can only use directly adjacent tokens.

Training: The Phase Change of Induction Heads

Induction Heads don't emerge gradually during training. Instead, there is a dramatic phase change - a well-observable moment where the capability suddenly appears.

5.0B Tokens
Fig. 3 | Training loss curve with phase change: At approximately 2.5-5 billion tokens, a clear "bump" occurs - the sign of Induction Head emergence. The slider shows position in training progress.

The Bump in the Loss Curve

Phase Change Characteristic
Timing: 2.5 - 5 billion training tokens
Signal: Clear "bump" visible in the loss curve
Nature: Not gradual change, but discrete transition

Interpretation:
- Model reaches critical complexity
- Induction Heads "click" into place
- ICL capability emerges suddenly
- Loss improves sharply afterwards

What Happens at the Phase Change?

Timing is Early!

Note: 2.5-5B tokens is relatively early in training. Large models train on trillions of tokens. This means: ICL is a fundamental capability that develops quickly.

In-Context Learning: How Induction Heads Enable It

In-Context Learning is the ability of LLMs to learn new tasks from a few examples, without the model being retrained. Induction Heads are the circuits behind it.

Example: Few-Shot Pattern Learning

User: Translate English β†’ German Example 1: "hello" β†’ "hallo" Example 2: "goodbye" β†’ "auf wiedersehen" New Input: "thank you" β†’ ? IH recognizes: [English][German]...[English] β†’ [German] Output: "danke"

The model never saw in its training set that it should translate "thank you". But the Induction Heads recognize the pattern in the examples and generalize correctly.

Mechanism: Pattern Matching in Prompts

Fig. 4 | In-Context Learning in action: Prompt with examples, then new input. Induction Heads recognize the [Example] β†’ [Output] pattern and generalize to the new input.

ICL is Emergent!

Emergent Property
In-Context Learning is not explicitly trained.

Training process:
1. Model trained on billions of tokens (standard LM loss)
2. Model sees diverse text examples, repetitions, patterns
3. As a byproduct: Induction Heads emerge
4. Byproduct enables: Pattern Completion
5. Pattern Completion enables: ICL

Result: Model can do ICL, although this was never part of the loss!

Limitations of Induction Heads

The Critical Limitation: At Least 2 Layers Needed

Single-Layer Models

Cannot form Induction Heads. A single layer is not sufficient for the Two-Layer Circuit. Consequence: No significant ICL possible.

Design Implication

Minimum network depth is required for ICL. Shallow models have a fundamental limitation.

Architecture Layers IH Possible? ICL Quality Use Case
Shallow Transformer 1 Layer No None NLP Toys
Standard Transformer 2-4 Layers Yes (weak) Weak Small Models
Modern LLM 20-80 Layers Yes (strong) Strong Production

Other Limitations (researched but not strongly documented)

Practical Examples for Induction Heads

Example 1: Code Completion

# Pattern in prompt: def add(a, b): return a + b def subtract(a, b): return a - b def multiply(a, b): # IH recognizes: Function β†’ Implementation pattern # Output: return a * b

Example 2: Language Translation

English: The weather is nice. German: Das Wetter ist schoen. English: I love programming. German: # Pattern: English β†’ German # IH completes: Ich liebe Programmieren.

Example 3: Format Understanding

JSON Format Examples: {"name": "Alice", "age": 30} {"name": "Bob", "age": 25} New Input: {"name": "Charlie", "age": # Pattern: Name β†’ Age Structure # IH predicts: 35} (e.g.)

In all these cases, the model never explicitly learned to generate code, or write German. It simply recognizes the pattern [X][Y]...[X] β†’ [Y] and generalizes.

Key Insights

1. Two-Layer Circuit

Induction Heads are not a single attention operation, but a composition of two layers: Previous-Token + Pattern Matching.

2. Emergent Phenomenon

ICL is not trained. It's a byproduct of LM Pretraining. Induction Heads emerge spontaneously at 2.5-5B tokens.

3. Phase Change

The emergence is not gradual - there is a clear phase change with a recognizable bump in the loss curve.

4. Depth is Essential

Single-layer models cannot form IH. At least 2 layers are required for ICL - a fundamental constraint.

5. Pattern Completion

The core function: [A][B]...[A] β†’ [B]. Simple but powerful mechanism for most ICL tasks.

6. Interpretability

Induction Heads are an example of mechanistic interpretability - we can literally see and understand the circuits in the model.