A vision model trained by masking out large chunks of an image (e.g., 75%) and forcing the neural network to reconstruct the missing pixels.
BERT-like pre-training for images.
State-of-the-art for visual representation learning.