Vision Transformer (ViT)

What is Vision Transformer (ViT)?

The Vision Transformer (ViT) applies the Transformer architecture directly to images with minimal modifications. By splitting an image into fixed-size patches and treating them as 'tokens' (similar to words in NLP), ViT learns global relationships across the entire image, challenging the long-standing dominance of Convolutional Neural Networks (CNNs).

Where did the term "Vision Transformer (ViT)" come from?

Introduced by Google Research in 'An Image is Worth 16x16 Words' (2020).

How is "Vision Transformer (ViT)" used today?

The foundation for modern multimodal models like GPT-4V and Gemini, which can process both text and images.

Related Terms

Transformer Architecture
Patch Embedding
Deep Learning
Convolutional Neural Network (CNN)