Sinusoidal Position Encoding

How Transformers understand token order – through sine and cosine waves of different frequencies, each position is uniquely encoded.

Positional Encoding gives Transformers a sense of order. Since Self-Attention is position-agnostic, we must explicitly encode each token's position into the input vector.

📖 Learning Context ▼

Understand why Transformers without PE would only be "Bags of Words"
Follow the sine/cosine formula and its frequency patterns
Recognize how different frequencies encode different positional scales

Step 3/8 Transformer Fundamentals

After Tokenization (1) and Embedding (2), we have meaningful vectors, but without position information. Here we add the order before the vectors flow into Self-Attention (Step 4).

Without Position Encoding, "The dog bites the man" would be processed the same as "The man bites the dog". Modern models use RoPE instead of sinusoidal encodings, as RoPE extrapolates better to longer sequences.

PE is added to token embeddings (not concatenated)
Different frequencies encode different positional scales
Modern standard: RoPE (Llama, GPT-4) instead of sinusoidal

PE_{(pos, 2i)} = sin(pos / 10000^2i/d) PE_{(pos, 2i+1)} = cos(pos / 10000^2i/d)

Even dimensions (2i) use sine, odd dimensions (2i+1) use cosine.
Different dimensions have different frequencies (10000^2i/d).

💡 Why Sine and Cosine?

Unique Positions: Each position receives a unique vector. The combination of different frequencies works like a "binary counter" – low dimensions oscillate fast (ones place), high ones slowly (thousands place).

Relative Positions: For any fixed distance k, there exists a linear transformation that maps PE(pos) to PE(pos+k). This allows the model to learn to use relative distances.

Generalization: The functions are defined for arbitrary positions – theoretically even for longer sequences than seen during training.

Sinusoidal (Original) 2017

No trainable parameters
Theoretically unlimited length
Fixed, deterministic values
Used in: Original Transformer

RoPE (Rotary) Modern

Rotation instead of addition
Better extrapolation
Relative positions naturally
Used in: Llama, Mistral, PaLM

ALiBi (Linear Bias) Modern

No separate encoding
Bias directly on attention scores
Zero-Shot length extrapolation
Used in: BLOOM, MPT

Sinusoidal Position Encoding

Learning Objectives

Context: Where are we?

Why it matters

Key Takeaways