A special, learnable vector added to the beginning of an input sequence in Transformer models. The final hidden state corresponding to this token is used as the aggregate representation of the entire sequence (e.g., for sentence classification) because it can attend to all other tokens.
Introduced with BERT (2018) and later adopted by Vision Transformers (ViT).
A standard architectural pattern in encoder-only models for tasks like sentiment analysis or image classification.