CHAPTER 4.2 · POSITIONAL ENCODING

ALiBi (Attention with Linear Biases)

Different attention heads learn different ranges: from local bigrams to sentence-wide dependencies – through simple linear bias terms.

ALiBi solves the extrapolation problem with an elegant trick: Instead of learning position, it penalizes distance. The further away a token is, the stronger the negative bias. Different heads have different slopes – thus automatically creating local and global experts.

📖 Learning Context ▼

Understand the linear distance-bias principle: bias = -m × |i-j|
Recognize how different slope values m create head specialization
Understand why ALiBi extrapolates without additional parameters

Step 2/6 Optimizations & Memory

Position Encoding determines how far a model can "see". ALiBi is one of two modern solutions (alongside RoPE) for context extension beyond training length.

BLOOM (176B parameters) uses ALiBi and can extrapolate to 100K+ tokens, even though it was only trained on 2K. The principle is so simple that it can be implemented in a single line of code.

No learnable parameters: ALiBi biases are fixed, not trainable
Head specialization: Slope values m follow a geometric series (2^-1 to 2^-8)
Zero-shot extrapolation: Works on sequences longer than in training

Core Idea

ALiBi replaces complex position embeddings with an elegant trick: Each attention head gets a linear bias that penalizes distant tokens. The formula is simple: bias(i,j) = -m × |i-j|. Different heads have different slopes m – so some specialize on local, others on global dependencies.

8 Heads = 8 Different Ranges

Display:

Attention:

0% 100%

Focus Token: How Strongly Are Distant Tokens Penalized?

Select Query Position:

Head 0 (m=1/8) – Short Range

Head 7 (m=1/1024) – Long Range

Attention Computation: Step by Step

How ALiBi Modifies Attention Scores

Extrapolation: Training Short, Inference Long

Sequence Length:

512 Tokens

Training

Tokens

→

Inference
512
50% of training length

✓

Works?

Yes!

ALiBi scales linearly

ALiBi vs. RoPE: The Comparison

📐

ALiBi

✅ Simple: Only adds constants (no rotation)

✅ 30% faster than RoPE in computation

✅ Extrapolates: Training on 1K, inference up to 8K+

✅ Used by: BLOOM (176B), MPT (7B-65B)

🔄

RoPE

⚙️ Complex: Rotates Q/K vectors in the complex plane

⚙️ More computation through sine/cosine operations

⚠️ Needs interpolation for longer sequences

⚙️ Used by: LLaMA, Mistral, GPT-NeoX

Why Different Slopes?

Small slopes (m=1/8): Strong local preference for bigrams and phrases. Large slopes (m=1/1024): Weak distance penalty enables long-range dependencies.

Causal Masking

The upper triangular matrix (j > i) is set to -∞ – just like standard attention. ALiBi bias only applies to the lower triangular matrix (past tokens).

Production-Ready

BLOOM (176B parameters) and MPT models use ALiBi as standard. Simple implementation, better length generalization than sinusoidal.

ALiBi (Attention with Linear Biases)

Learning Objectives

Context: Where are we?

Why it matters

Key Takeaways