An architecture that applies the Transformer mechanism directly to images by splitting them into 'patches' (tokens), challenging the dominance of CNNs by learning global relationships from the start.
Google Research (Dosovitskiy et al., 2020).
The foundation for modern multimodal models (like GPT-4V).