4 Heads
64
Step 1: Individual Head Outputs
Each of the h heads has performed its own attention operation and produces a vector of dimension dk.
Multi-Head Attention Formula:

headi = Attention(QWQi, KWKi, VWVi)
MultiHead(Q, K, V) = Concat(head1, ..., headh)WO

Dimensions:
headi ∈ ℝ(n×dk) for each head
Concat(...) ∈ ℝ(n×h·dk) = ℝ(n×dmodel)
WO ∈ ℝ(dmodel×dmodel)
Output ∈ ℝ(n×dmodel)
Why Concatenation?
Each head learns different aspects (syntax, semantics, position). Through concatenation, all perspectives are combined before WO produces a final representation.
Dimension Preservation
h heads × dk = dmodel. Example: 8 heads × 64 = 512. The output projection WO brings the dimension back to dmodel if h·dk ≠ dmodel.
Output Projection
WO is a trainable matrix that linearly transforms the concatenated result. This enables interaction between head outputs.
Parallel Processing
All heads can be computed in parallel (GPU-optimized). Concatenation is a simple reshape operation without additional computations.