A special learnable vector added to the beginning of the input sequence. The final state of this token is used as the aggregate representation of the entire image/text for classification tasks.
Introduced in BERT, adopted by ViT.
Common pattern in encoder-only models.