How the outputs of multiple attention heads are combined and transformed through output projection WO.
Head Concatenation is the final step of Multi-Head Attention: The parallel perspectives from multiple heads are combined into a single vector. This mechanism allows the model to utilize syntactic, semantic, and positional information simultaneously.
After parallel computation of all attention heads, their outputs are merged here. The WO matrix enables a final interaction between perspectives before the result is passed to the feedforward network.
Without concatenation, head outputs would remain isolated. The WO matrix is trainable and learns which head combinations are optimal for different tasks. With 32-128 heads in modern models, this integration is crucial for overall performance.