Advanced Attention Heatmap

Attention Heatmap (Position × Document)

High Attention (1.0)

Medium Attention (0.5)

Low (0.0)

Entropy

0.00

Max Attention

0.00

Focus Width

0.00

Head Type

Syntax

Fig. 1 | Interactive attention heatmap: X-axis = Document positions, Y-axis = Query position. Light purple = High attention weights (the model is paying attention). Dark = Low (ignored). Temperature controls sharpness of the distribution. Different heads specialize: Head 0 attends to syntax, Head 1 to semantics.

🎯

Attention as Focus Mechanism

Bright colors show what the model is "attending to". Query position 4 ("Model") can look at different tokens. Distribution shows: Important contexts receive high weights.

🌡️

Temperature Regulates Sharpness

τ = 0.1 → Very sharp (almost only one token). τ = 1.0 → Balanced. τ = 2.0 → Diffuse (many tokens equally). Higher temperature = larger window for "context understanding".

🧠

Multi-Head Specialization

Head 0 (Syntax): Attends to grammatical structure (often nearby words). Head 1 (Semantics): Attends to semantically related words (can be far away). Heads together = Complete understanding.

📊

Entropy Measures Uncertainty

High entropy = Many possibilities (model is uncertain). Low entropy = Clear focus (model has confidence). Good models: High entropy when ambiguous, low for clear cases.

🔍

Window Mechanics (Causal Masking)

Auto-Regressive models: Can only look backwards (Causal Mask). Query at position 4 sees only [0,1,2,3,4]. Not [5,6,7]. This mechanism prevents the model from "cheating" by seeing the future.

⚖️

Softmax Normalization Guarantees Sum=1

All weights together sum to 1.0 (probability distribution). Toggle normalization on/off to see the difference. Without norm: Weights can exceed 1. With norm: Reliable, interpretable distribution.

Advanced Attention Heatmap Interactive

Learning Objectives

Context: Where are we?

Why It Matters

Key Takeaways

⚙️ Interactive Controls