Core Idea

ALiBi replaces complex position embeddings with an elegant trick: Each attention head gets a linear bias that penalizes distant tokens. The formula is simple: bias(i,j) = -m × |i-j|. Different heads have different slopes m – so some specialize on local, others on global dependencies.

1
8 Heads = 8 Different Ranges
Display:
Attention:
0% 100%
2
Focus Token: How Strongly Are Distant Tokens Penalized?
Select Query Position:
Head 0 (m=1/8) – Short Range
Head 7 (m=1/1024) – Long Range
3
Attention Computation: Step by Step
How ALiBi Modifies Attention Scores
4
Extrapolation: Training Short, Inference Long
Sequence Length:
512 Tokens
Training
1K
Tokens
Inference
512
50% of training length
Works?
Yes!
ALiBi scales linearly
5
ALiBi vs. RoPE: The Comparison
📐
ALiBi
Simple: Only adds constants (no rotation)
30% faster than RoPE in computation
Extrapolates: Training on 1K, inference up to 8K+
Used by: BLOOM (176B), MPT (7B-65B)
🔄
RoPE
⚙️ Complex: Rotates Q/K vectors in the complex plane
⚙️ More computation through sine/cosine operations
⚠️ Needs interpolation for longer sequences
⚙️ Used by: LLaMA, Mistral, GPT-NeoX
Why Different Slopes?
Small slopes (m=1/8): Strong local preference for bigrams and phrases. Large slopes (m=1/1024): Weak distance penalty enables long-range dependencies.
Causal Masking
The upper triangular matrix (j > i) is set to -∞ – just like standard attention. ALiBi bias only applies to the lower triangular matrix (past tokens).
Production-Ready
BLOOM (176B parameters) and MPT models use ALiBi as standard. Simple implementation, better length generalization than sinusoidal.
Switch head · 1-4 Select step