An efficient method for handling position in Transformer models without using learned positional embeddings. ALiBi adds a static, non-learned bias to the attention scores that penalizes the interaction between tokens based on their distance: the further apart two tokens are, the less they can attend to each other. This simple inductive bias allows models trained on short sequences to extrapolate effectively to much longer sequences at inference time.
Introduced in the paper 'Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation' (2021) by Press et al.
Used in models like MPT (MosaicML Pretrained Transformer) and BLOOM to enable extremely long context windows.