The Vision Transformer (ViT) applies the Transformer architecture directly to images with minimal modifications. By splitting an image into fixed-size patches and treating them as 'tokens' (similar to words in NLP), ViT learns global relationships across the entire image, challenging the long-standing dominance of Convolutional Neural Networks (CNNs).
Introduced by Google Research in 'An Image is Worth 16x16 Words' (2020).
The foundation for modern multimodal models like GPT-4V and Gemini, which can process both text and images.