An optimization algorithm used to minimize the loss function by iteratively moving in the direction of steepest descent. SGD uses small batches to estimate the gradient.
The engine that powers neural network training.
Variants like Adam are used in virtually all LLM/Vision training.