The number of training examples processed together in one iteration. The model updates its weights once per batch. Larger batches provide a more stable gradient estimate but require more VRAM, while smaller batches add noise that can help escape local minima.
A fundamental hyperparameter in Stochastic Gradient Descent (SGD) and its variants.
Critical for optimizing training speed and stability, often limited by the available GPU memory.