Different attention heads learn different ranges: from local bigrams to sentence-wide dependencies – through simple linear bias terms.
ALiBi solves the extrapolation problem with an elegant trick: Instead of learning position, it penalizes distance. The further away a token is, the stronger the negative bias. Different heads have different slopes – thus automatically creating local and global experts.
Position Encoding determines how far a model can "see". ALiBi is one of two modern solutions (alongside RoPE) for context extension beyond training length.
BLOOM (176B parameters) uses ALiBi and can extrapolate to 100K+ tokens, even though it was only trained on 2K. The principle is so simple that it can be implemented in a single line of code.
ALiBi replaces complex position embeddings with an elegant trick: Each attention head gets a linear bias
that penalizes distant tokens. The formula is simple: bias(i,j) = -m × |i-j|.
Different heads have different slopes m – so some specialize on local, others on global dependencies.