Vision Transformer (ViT)

What is Vision Transformer (ViT)?

An architecture that applies the Transformer mechanism directly to images by splitting them into 'patches' (tokens), challenging the dominance of CNNs by learning global relationships from the start.

Where did the term "Vision Transformer (ViT)" come from?

Google Research (Dosovitskiy et al., 2020).

How is "Vision Transformer (ViT)" used today?

The foundation for modern multimodal models (like GPT-4V).

Related Terms