⚙️ Interactive Controls

Position: 4 (Word: "Model")
τ = 1.0
8 positions
Document:
[0] The The | [1] modern | [2] artificial | [3] intelligence | [4] Model | [5] needs | [6] many | [7] parameters
Attention Heatmap (Position × Document)
High Attention (1.0)
Medium Attention (0.5)
Low (0.0)
Entropy
0.00
Max Attention
0.00
Focus Width
0.00
Head Type
Syntax
Fig. 1 | Interactive attention heatmap: X-axis = Document positions, Y-axis = Query position. Light purple = High attention weights (the model is paying attention). Dark = Low (ignored). Temperature controls sharpness of the distribution. Different heads specialize: Head 0 attends to syntax, Head 1 to semantics.
🎯
Attention as Focus Mechanism
Bright colors show what the model is "attending to". Query position 4 ("Model") can look at different tokens. Distribution shows: Important contexts receive high weights.
🌡️
Temperature Regulates Sharpness
τ = 0.1 → Very sharp (almost only one token). τ = 1.0 → Balanced. τ = 2.0 → Diffuse (many tokens equally). Higher temperature = larger window for "context understanding".
🧠
Multi-Head Specialization
Head 0 (Syntax): Attends to grammatical structure (often nearby words). Head 1 (Semantics): Attends to semantically related words (can be far away). Heads together = Complete understanding.
📊
Entropy Measures Uncertainty
High entropy = Many possibilities (model is uncertain). Low entropy = Clear focus (model has confidence). Good models: High entropy when ambiguous, low for clear cases.
🔍
Window Mechanics (Causal Masking)
Auto-Regressive models: Can only look backwards (Causal Mask). Query at position 4 sees only [0,1,2,3,4]. Not [5,6,7]. This mechanism prevents the model from "cheating" by seeing the future.
⚖️
Softmax Normalization Guarantees Sum=1
All weights together sum to 1.0 (probability distribution). Toggle normalization on/off to see the difference. Without norm: Weights can exceed 1. With norm: Reliable, interpretable distribution.