Nesterov momentum, or Nesterov Accelerated Gradient (NAG), is a slightly modified version of Momentum with stronger theoretical convergence guarantees for convex functions. In practice, it has produced slightly better results than classical Momentum.
Difference Between Momentum and Nesterov Momentum
In the standard Momentum method, the gradient is computed using current parameters (θt). Nesterov momentum achieves stronger convergence by applying the velocity (vt) to the parameters in order to compute interim parameters (θ̃ = θt+μ*vt), where μ is the decay rate. These interim parameters are then used to compute the gradient, called a "lookahead" gradient step or a Nesterov Accelerated Gradient.
The reason this is sometimes referred to as a "lookahead" gradient is that computing the gradient based on interim parameters allow NAG to change velocity in a faster and more responsive way, resulting in more stable behavior than classical Momentum in many situations, particularly for higher values of μ. NAG is the correction factor for classical Momentum method.
CS231n Convolutional Neural Networks for Visual Recognition
Stanford Computer Science
Momentum Method and Nesterov Accelerated Gradient - Konvergen - Medium
On the importance of initialization and momentum in deep learning
Ilya Sutskever, James Martens, George Dahl, Geoffrey Hinton
Documentaries, videos and podcasts
- Momentum (support vector machine)A machine learning strategy that helps accelerate stochastic gradient descent in the relevant direction while dampening oscillations.
- Stochastic gradient descent (SGD)Gradient-based optimization algorithm used in machine learning and deep learning for training artificial neural networks.
- RMSpropUnpublished but widely-known gradient descent optimization algorithm for mini-batch learning of neural networks.
- AdagradAdaptive gradient algorithm used for large scale machine learning tasks in distributed environments.
- Adam (support vector machine)A gradient-based machine learning optimization algorithm that computes individual adaptive learning rates for each parameter, combining the advantages of Adagrad and RMSprop.
- Show More