How Transformers understand token order – through sine and cosine waves of different frequencies, each position is uniquely encoded.
Positional Encoding gives Transformers a sense of order. Since Self-Attention is position-agnostic, we must explicitly encode each token's position into the input vector.
After Tokenization (1) and Embedding (2), we have meaningful vectors, but without position information. Here we add the order before the vectors flow into Self-Attention (Step 4).
Without Position Encoding, "The dog bites the man" would be processed the same as "The man bites the dog". Modern models use RoPE instead of sinusoidal encodings, as RoPE extrapolates better to longer sequences.
Unique Positions: Each position receives a unique vector. The combination of different frequencies works like a "binary counter" – low dimensions oscillate fast (ones place), high ones slowly (thousands place).
Relative Positions: For any fixed distance k, there exists
a linear transformation that maps PE(pos) to PE(pos+k).
This allows the model to learn to use relative distances.
Generalization: The functions are defined for arbitrary positions – theoretically even for longer sequences than seen during training.