The attention circuits that enable LLMs to learn from prompt examples - the technical foundation for In-Context Learning
Induction Heads are specialized attention circuits that enable pattern completion. They are the reason why LLMs can learn from a few examples in the prompt (In-Context Learning).
Technical foundations as reference.
Induction Heads explain why LLMs can "learn" without changing their weights. They are a key concept for understanding emergent capabilities.
Induction Heads are specialized attention circuits in the Transformer, that implement a fundamental capability: Pattern Completion.
They are the technical foundation for In-Context Learning (ICL) - the remarkable ability of LLMs to learn from examples in a prompt, without the model being retrained.
When an Induction Head sees a Token it has seen before, it "remembers" what came after - and predicts exactly that.
IH recognize when a token is repeated and can predict the next token in the sequence.
IH enable models to recognize and apply new patterns in prompts - without retraining.
ICL is not explicitly trained. IH emerge spontaneously during training as a byproduct.
Induction Heads don't work as a single attention operation. They require a two-layer composition:
A single layer can only process direct neighbors. To recognize the pattern [A][B]...[A] β [B], the model must:
Induction Heads form. In-Context Learning capability emerges. Model can generalize patterns.
No Induction Heads possible. No significant ICL. Model can only use directly adjacent tokens.
Induction Heads don't emerge gradually during training. Instead, there is a dramatic phase change - a well-observable moment where the capability suddenly appears.
Note: 2.5-5B tokens is relatively early in training. Large models train on trillions of tokens. This means: ICL is a fundamental capability that develops quickly.
In-Context Learning is the ability of LLMs to learn new tasks from a few examples, without the model being retrained. Induction Heads are the circuits behind it.
The model never saw in its training set that it should translate "thank you". But the Induction Heads recognize the pattern in the examples and generalize correctly.
Cannot form Induction Heads. A single layer is not sufficient for the Two-Layer Circuit. Consequence: No significant ICL possible.
Minimum network depth is required for ICL. Shallow models have a fundamental limitation.
| Architecture | Layers | IH Possible? | ICL Quality | Use Case |
|---|---|---|---|---|
| Shallow Transformer | 1 Layer | No | None | NLP Toys |
| Standard Transformer | 2-4 Layers | Yes (weak) | Weak | Small Models |
| Modern LLM | 20-80 Layers | Yes (strong) | Strong | Production |
In all these cases, the model never explicitly learned to generate code, or write German. It simply recognizes the pattern [X][Y]...[X] β [Y] and generalizes.
Induction Heads are not a single attention operation, but a composition of two layers: Previous-Token + Pattern Matching.
ICL is not trained. It's a byproduct of LM Pretraining. Induction Heads emerge spontaneously at 2.5-5B tokens.
The emergence is not gradual - there is a clear phase change with a recognizable bump in the loss curve.
Single-layer models cannot form IH. At least 2 layers are required for ICL - a fundamental constraint.
The core function: [A][B]...[A] β [B]. Simple but powerful mechanism for most ICL tasks.
Induction Heads are an example of mechanistic interpretability - we can literally see and understand the circuits in the model.