PE(pos, 2i) = sin(pos / 100002i/d) PE(pos, 2i+1) = cos(pos / 100002i/d)
Even dimensions (2i) use sine, odd dimensions (2i+1) use cosine.
Different dimensions have different frequencies (100002i/d).
Position in Text
0
Embedding Dimension (d)
64
Displayed Dimensions
d₀ d₂ d₄ d₈ d₁₆ d₃₂
Sine Waves for Different Dimensions
Position Encoding Vector
pos = 0
Position Encoding Matrix
64 positions × 64 dimensions – each row is a unique position vector
Dimension 0 Dimension 63
← High Frequency | Low Frequency →
-1 0 +1
💡 Why Sine and Cosine?

Unique Positions: Each position receives a unique vector. The combination of different frequencies works like a "binary counter" – low dimensions oscillate fast (ones place), high ones slowly (thousands place).

Relative Positions: For any fixed distance k, there exists a linear transformation that maps PE(pos) to PE(pos+k). This allows the model to learn to use relative distances.

Generalization: The functions are defined for arbitrary positions – theoretically even for longer sequences than seen during training.

Sinusoidal (Original) 2017
  • No trainable parameters
  • Theoretically unlimited length
  • Fixed, deterministic values
  • Used in: Original Transformer
RoPE (Rotary) Modern
  • Rotation instead of addition
  • Better extrapolation
  • Relative positions naturally
  • Used in: Llama, Mistral, PaLM
ALiBi (Linear Bias) Modern
  • No separate encoding
  • Bias directly on attention scores
  • Zero-Shot length extrapolation
  • Used in: BLOOM, MPT