Sinusoidal Positional Encoding

Each position receives a unique vector through sine and cosine functions of different frequencies. Low dimensions change quickly (high frequency), high dimensions slowly (low frequency). This allows the model to distinguish both local and global positions.

Controls

64
256

Color Scale

-1.0 (Dark Blue) 0.0 (Yellow) +1.0 (Dark Red)
Fig. 1 | Position Encoding Matrix (Position × Dimension). Each row corresponds to a token position, each column to an embedding dimension. Even dimensions use sine, odd use cosine. Note the periodic stripes: Low dimensions (left) oscillate quickly, high dimensions (right) slowly.

Formula

PE(pos, 2i) = sin(pos / 100002i/dmodel)
PE(pos, 2i+1) = cos(pos / 100002i/dmodel)

pos: Position of the token (0 to n-1)
i: Dimension index (0 to dmodel/2)
dmodel: Embedding dimensions (e.g., 512)

Why this formula?

  • Unique encoding for each position
  • Different frequencies for different dimensions
  • Low dimensions: High frequency (fast change)
  • High dimensions: Low frequency (slow change)
  • Enables generalization to longer sequences
  • No trainable parameters

Patterns in the Heatmap

Vertical stripes: Show the periodic nature of sine/cosine functions. Low dimensions have narrow stripes (high frequency), high dimensions wide stripes (low frequency).

Horizontal variation: Shows how encoding changes across positions. Each position has a unique fingerprint.

Usage in Practice

Position Encoding is added to token embeddings:

Input = Token_Embedding + Position_Encoding

Original Transformer (2017) uses sinusoidal PE. Modern models often use:

  • RoPE: Llama, PaLM, GPT-NeoX
  • ALiBi: BLOOM, MPT
  • Learned PE: BERT, GPT-2