8 parallel attention heads learn different aspects of language – from syntactic structures to semantic relationships.
Multi-Head Attention extends Self-Attention with parallel perspectives. Instead of a single attention mechanism, multiple heads work simultaneously, where each head can specialize in different patterns: syntax, semantics, coreferences.
Building on the Q·KT calculation (Step 4), multiple attention mechanisms are executed in parallel here. The results then flow into the feedforward layer (Step 6), which stores the model's knowledge.
A single attention head can only capture one type of relationship. Multi-Head Attention enables the model to simultaneously learn syntactic structures, semantic similarities, and coreferences. Modern models use 32-128 heads (GPT-4: 128, Llama 3 70B: 64 query heads with 8 KV heads through Grouped Query Attention).
A single attention mechanism would have to capture all aspects of language simultaneously.
Multi-Head Attention solves this problem through parallelization:
Each head can specialize in different relationships –
syntactic structures (subject-verb), semantic similarities, coreferences
(who is "he"?), or simply adjacent tokens. The outputs are concatenated
and projected through WO, so the model can combine all
perspectives.