Q·K^T Matrix Multiplication

Step by step: How Query and Key vectors become attention scores. Each dot product measures the "compatibility" between two tokens.

Self-Attention is the mathematical heart of the Transformer. The Q·K^T multiplication computes a score for each token pair, determining how much the tokens should "attend" to each other.

📖 Learning Context ▼

Understand the roles of Query, Key, and Value vectors
Follow the calculation of the dot product Q·K^T
Recognize why scaling by √d_k is necessary

Step 4/8 Transformer Fundamentals

After Tokenization (1), Embedding (2), and Position Encoding (3), we have vectors that encode both meaning and position. Now Self-Attention calculates which tokens should interact with each other. These scores are then extended to Multi-Head Attention (Step 5).

The dot product Q·K^T produces an n×n attention matrix – the source of the quadratic O(n²) complexity. However, this complexity enables direct connections between arbitrarily distant tokens, which RNNs cannot do. Scaling by √d_k prevents large dimensions from saturating the softmax and causing vanishing gradients.

Query asks "What am I looking for?", Key answers "What do I offer?"
The dot product measures compatibility between token pairs
Scaling by √d_k (e.g., √64 = 8) stabilizes training

Attention(Q, K, V) = softmax(QK^T / √d_k) · V

Scaling by √d_k prevents large values from saturating the Softmax.

💡 What's happening here?

Each cell in the Attention Score Matrix is the dot product of a Query row with a Key column. The value Score[i,j] measures how much Token i should "attend" to Token j. High values mean high relevance. After softmax, these scores become normalized weights that determine how much information flows from each token.

Q·K^T Matrix Multiplication

Learning Objectives

Context: Where are we?

Why it matters

Key Takeaways

Related Visualizations