Nesterov momentum, or Nesterov Accelerated Gradient (NAG), is a slightly modified version of Momentum with stronger theoretical convergence guarantees for convex functions. In practice, it has produced slightly better results than classical Momentum.

### Difference Between Momentum and Nesterov Momentum

In the standard Momentum method, the gradient is computed using current parameters (* θt*). Nesterov momentum achieves stronger convergence by applying the velocity (

*) to the parameters in order to compute interim parameters (*

**vt**

**θ̃ = θt+μ*vt***)*, where μ is the decay rate. These interim parameters are then used to compute the gradient, called a "lookahead" gradient step or a Nesterov Accelerated Gradient.

The reason this is sometimes referred to as a "lookahead" gradient is that computing the gradient based on interim parameters allow NAG to change velocity in a faster and more responsive way, resulting in more stable behavior than classical Momentum in many situations, particularly for higher values of *μ. *NAG is the correction factor for classical Momentum method.

## Timeline

## People

## Further reading

CS231n Convolutional Neural Networks for Visual Recognition

Stanford Computer Science

Web

Momentum Method and Nesterov Accelerated Gradient - Konvergen - Medium

Roan Gylberth

Web

On the importance of initialization and momentum in deep learning

Ilya Sutskever, James Martens, George Dahl, Geoffrey Hinton