Example Sentence
Head 1 – Position-based
0.0 0.5 1.0
Description
This head focuses on...
Statistics
0.00
Entropy
0%
Sparsity
0.00
Max Attention
0.00
Avg Attention
Typical Function
In trained models...
Multi-Head Attention: Concat(head₁, ..., head₈) × WO
H1
H2
H3
H4
H5
H6
H7
H8
8 × 64 = 512 dim
×
WO
512×512
Output
512 dim
💡 Why multiple heads?

A single attention mechanism would have to capture all aspects of language simultaneously. Multi-Head Attention solves this problem through parallelization: Each head can specialize in different relationships – syntactic structures (subject-verb), semantic similarities, coreferences (who is "he"?), or simply adjacent tokens. The outputs are concatenated and projected through WO, so the model can combine all perspectives.