Multi-Head Attention Comparison

8 parallel attention heads learn different aspects of language – from syntactic structures to semantic relationships.

Multi-Head Attention extends Self-Attention with parallel perspectives. Instead of a single attention mechanism, multiple heads work simultaneously, where each head can specialize in different patterns: syntax, semantics, coreferences.

📖 Learning Context ▼

Understand why multiple attention heads are better than one
Recognize how different heads specialize in different patterns
Comprehend how head outputs are concatenated and projected

Step 5/8 Transformer Fundamentals

Building on the Q·K^T calculation (Step 4), multiple attention mechanisms are executed in parallel here. The results then flow into the feedforward layer (Step 6), which stores the model's knowledge.

A single attention head can only capture one type of relationship. Multi-Head Attention enables the model to simultaneously learn syntactic structures, semantic similarities, and coreferences. Modern models use 32-128 heads (GPT-4: 128, Llama 3 70B: 64 query heads with 8 KV heads through Grouped Query Attention).

Heads emergently specialize in different linguistic patterns
Parallel computation is efficient on modern GPUs (Batched MatMul)
Output: All heads are concatenated and linearly projected (W^O)

Head 1 – Position-based

0.0 0.5 1.0

Description

This head focuses on...

Statistics

0.00

Entropy

Sparsity

0.00

Max Attention

0.00

Avg Attention

Typical Function

In trained models...

Multi-Head Attention: Concat(head₁, ..., head₈) × W^O

→

8 × 64 = 512 dim

W^O

512×512

→

Output

512 dim

💡 Why multiple heads?

A single attention mechanism would have to capture all aspects of language simultaneously. Multi-Head Attention solves this problem through parallelization: Each head can specialize in different relationships – syntactic structures (subject-verb), semantic similarities, coreferences (who is "he"?), or simply adjacent tokens. The outputs are concatenated and projected through W^O, so the model can combine all perspectives.

Multi-Head Attention Comparison

Learning Objectives

Context: Where are we?

Why it matters

Key Takeaways