Step by step: How Query and Key vectors become attention scores. Each dot product measures the "compatibility" between two tokens.
Self-Attention is the mathematical heart of the Transformer. The Q·KT multiplication computes a score for each token pair, determining how much the tokens should "attend" to each other.
After Tokenization (1), Embedding (2), and Position Encoding (3), we have vectors that encode both meaning and position. Now Self-Attention calculates which tokens should interact with each other. These scores are then extended to Multi-Head Attention (Step 5).
The dot product Q·KT produces an n×n attention matrix – the source of the quadratic O(n²) complexity. However, this complexity enables direct connections between arbitrarily distant tokens, which RNNs cannot do. Scaling by √dk prevents large dimensions from saturating the softmax and causing vanishing gradients.
Each cell in the Attention Score Matrix is the dot product
of a Query row with a Key column. The value Score[i,j] measures
how much Token i should "attend" to Token j. High values mean high relevance.
After softmax, these scores become normalized weights that determine
how much information flows from each token.