Position Encoding Matrix

Position Encoding Matrix Heatmap

Complete visualization of the Positional Encoding Matrix as a heatmap – revealing the periodic patterns

The Position Encoding Matrix shows all position vectors at a glance. The characteristic stripes reveal how different frequencies encode different position scales.

📖 Learning Context ▼

Interpret the matrix representation of Position Encoding
Recognize how frequency patterns uniquely encode positions
Understand the difference between local and global positions

Step 3/8 Transformer Fundamentals

This visualization complements the Sine/Cosine visualization (1.3) with a holistic perspective on the entire encoding matrix.

The heatmap reveals why the encoding works: High frequencies (left) distinguish neighboring positions, low frequencies (right) encode global position in the sequence.

Low dimensions: high frequency = local distinction
High dimensions: low frequency = global position
Each position has a unique pattern (no repetition problem)

Sinusoidal Positional Encoding

Each position receives a unique vector through sine and cosine functions of different frequencies. Low dimensions change quickly (high frequency), high dimensions slowly (low frequency). This allows the model to distinguish both local and global positions.

Formula

PE_{(pos, 2i)} = sin(pos / 10000^2i/d_model)
PE_{(pos, 2i+1)} = cos(pos / 10000^2i/d_model)

pos: Position of the token (0 to n-1)
i: Dimension index (0 to d_model/2)
d_model: Embedding dimensions (e.g., 512)

Why this formula?

Unique encoding for each position
Different frequencies for different dimensions
Low dimensions: High frequency (fast change)
High dimensions: Low frequency (slow change)
Enables generalization to longer sequences
No trainable parameters

Patterns in the Heatmap

Vertical stripes: Show the periodic nature of sine/cosine functions. Low dimensions have narrow stripes (high frequency), high dimensions wide stripes (low frequency).

Horizontal variation: Shows how encoding changes across positions. Each position has a unique fingerprint.

Usage in Practice

Position Encoding is added to token embeddings:

Input = Token_Embedding + Position_Encoding

Original Transformer (2017) uses sinusoidal PE. Modern models often use:

RoPE: Llama, PaLM, GPT-NeoX
ALiBi: BLOOM, MPT
Learned PE: BERT, GPT-2

Position Encoding Matrix Heatmap

Learning Objectives

Context: Where are we?

Why it matters

Key Takeaways

Sinusoidal Positional Encoding

Controls

Color Scale

Formula

Why this formula?

Patterns in the Heatmap

Usage in Practice