The most widely used optimization algorithm for training deep learning models. Adam combines the benefits of two other extensions of stochastic gradient descent: Momentum (which smoothes the optimization path using a moving average of gradients) and RMSProp (which scales the learning rate based on the magnitude of recent gradients). This allows it to handle sparse gradients and non-stationary objectives efficiently, often requiring little hyperparameter tuning.
Introduced by Diederik Kingma and Jimmy Ba in their 2014 paper 'Adam: A Method for Stochastic Optimization'.
The default optimizer in PyTorch and TensorFlow for most tasks, from vision to NLP.