An attention mechanism is a technique used in neural networks that allows the model to focus on specific parts of an input sequence when making predictions. Instead of treating all parts of the input equally, the model learns to assign different 'attention weights' to each part, giving more importance to the most relevant information. This is often described using a 'query-key-value' model, where a 'query' (the current focus) is matched against a set of 'keys' (all parts of the input) to determine the relevance of each part, and the corresponding 'values' are then combined to produce the output.
Attention mechanisms were first introduced in the context of machine translation in a 2014 paper by Bahdanau et al. to help recurrent neural networks (RNNs) better handle long sentences. However, their full potential was realized with the introduction of the Transformer architecture in the 2017 paper 'Attention Is All You Need' by Vaswani et al., which relied entirely on self-attention, a form of attention that relates different positions of a single sequence.
The attention mechanism, and particularly self-attention, has become the cornerstone of modern AI, especially in natural language processing. It is the core component of the Transformer architecture, which powers virtually all state-of-the-art large language models (LLMs) like GPT, BERT, and LLaMA. The concept has also been successfully applied to other domains, including computer vision, with the development of Vision Transformers (ViTs).