LLM Explorer – Head Concatenation

Head Concatenation Animation

How the outputs of multiple attention heads are combined and transformed through output projection W^O.

Head Concatenation is the final step of Multi-Head Attention: The parallel perspectives from multiple heads are combined into a single vector. This mechanism allows the model to utilize syntactic, semantic, and positional information simultaneously.

📖 Learning Context ▼

Understand how parallel head outputs are combined into one vector
Know the role of output projection W^O in information integration
Understand dimension preservation: h × d_k = d_model

Step 5/8 Transformer Fundamentals

After parallel computation of all attention heads, their outputs are merged here. The W^O matrix enables a final interaction between perspectives before the result is passed to the feedforward network.

Without concatenation, head outputs would remain isolated. The W^O matrix is trainable and learns which head combinations are optimal for different tasks. With 32-128 heads in modern models, this integration is crucial for overall performance.

Concatenation = Simple reshape without computations (GPU-efficient)
W^O ∈ ℝ^{(d_model × d_model)} enables head interaction
Each head specializes in different patterns (syntax, semantics, position)

Multi-Head Attention Formula:

head_i = Attention(QW^Q_i, KW^K_i, VW^V_i)
MultiHead(Q, K, V) = Concat(head₁, ..., head_h)W^O

Dimensions:
head_i ∈ ℝ^(n×d_k) for each head
Concat(...) ∈ ℝ^(n×h·d_k) = ℝ^(n×d_model)
W^O ∈ ℝ^{(d_model×d_model)}
Output ∈ ℝ^(n×d_model)

Why Concatenation?

Each head learns different aspects (syntax, semantics, position). Through concatenation, all perspectives are combined before W^O produces a final representation.

Dimension Preservation

h heads × d_k = d_model. Example: 8 heads × 64 = 512. The output projection W^O brings the dimension back to d_model if h·d_k ≠ d_model.

Output Projection

W^O is a trainable matrix that linearly transforms the concatenated result. This enables interaction between head outputs.

Parallel Processing

All heads can be computed in parallel (GPU-optimized). Concatenation is a simple reshape operation without additional computations.

Head Concatenation Animation

Learning Objectives

Context: Where are we?

Why it matters

Key Takeaways