Adam (name derived from Adaptive Moment Estimation) is a machine learning algorithm for first-order gradient-based optimization that computes adaptive learning rates for each parameter.
Adagrad works well for sparse gradients by scaling the learning rate for each parameter according to the history of gradients from previous iterations.
In RMSprop, the learning rate is adapted using the moving average of the squared gradient for each weight, making it useful for mini-batch learning in on-line settings.
There are a few small but important differences between Adam and the algorithms it is based on.
Adam and RMSprop with Momentum
Adam differs from RMSprop with momentum in 2 ways:
- RMSprop with momentum generates its parameter updates using a momentum on the rescaled gradient, whereas Adam updates parameters with direct estimates using a running average of first and second moment of the gradient.
- Adam has a bias-correction term and RMSprop does not.
Adam and Adagrad
In Adagrad, the abstract function that scales learning rate for parameters is the sum of all past and current gradients. In Adam, this abstract function is the moving averages of past squared gradients.
How Adam Works
Theberkeleyview summarized how Adam works in 5 steps:
- Compute the gradient and its element-wise square using the current parameters.
- Update the exponential moving average of the 1st-order moment and the 2nd-order moment.
- Compute an unbiased average of the 1st-order moment and 2nd-order moment.
- Compute weight update: 1st-order moment unbiased average divided by the square root of 2nd-order moment unbiased average (and scale by learning rate).
- Apply update to the weights.
Adam -- latest trends in deep learning optimization.
Adam: A Method for Stochastic Optimization
Diederik P. Kingma, Jimmy Ba
ADAM: A Method for Stochastic Optimization
UC Berkeley Computer Vision Review Letters
An overview of gradient descent optimization algorithms
Documentaries, videos and podcasts
- AdagradAdaptive gradient algorithm used for large scale machine learning tasks in distributed environments.
- AdadeltaAn extension of the Adagrad machine learning optimization algorithm which improves upon the two main drawbacks of the method.
- RMSpropUnpublished but widely-known gradient descent optimization algorithm for mini-batch learning of neural networks.
- Stochastic gradient descent (SGD)Gradient-based optimization algorithm used in machine learning and deep learning for training artificial neural networks.
- Momentum (support vector machine)A machine learning strategy that helps accelerate stochastic gradient descent in the relevant direction while dampening oscillations.
- Show More