The process of slicing an image into a grid of small squares (e.g., 16x16 pixels) and flattening them into vectors, effectively 'tokenizing' the image for a Transformer.
ViT's equivalent of word tokenization.
Standard preprocessing for Vision Transformers.