Momentum is a machine learning strategy that helps accelerate stochastic gradient descent (SGD) in the relevant direction while dampening oscillations. This is also referred to as "SGD with momentum" and is useful for training deep neural networks.
The idea of SGD with momentum can be conceptualized with an analogy from physics in which a ball gains and loses momentum as it rolls around on hilly terrain. Imagining that a learning algorithm's loss can be interpreted as the height of a hilly terrain, it's then possible to relate the gradient of the loss function with the force of a ball rolling up and down the hills. Specifically, the force is equal to the (negative) gradient of the loss function, where the loss function is represented as the potential energy of the ball at any point on a hill.
This means that: Force = −∇U,
where U = mgh (i.e. U is the potential energy of the ball). Setting the ball's initial velocity equal to zero at some location is analogous to initializing the parameters with random numbers.
Optimizing to minimize the loss function can then be seen as equivalent to trying to get the ball to reach the deepest valley in the terrain, where the loss function is smallest. When the slope of a hill is very high, the ball's momentum at the bottom will push it up and over shorter hills. When slope decreases, momentum and velocity of the ball also decrease, eventually resulting in the ball coming to a rest in a valley. In other words, the momentum strategy is simulating the parameter vector (i.e. the ball) as rolling on the hilly terrain. The goal is to descend the slope of the hill faster than without momentum, while still controlling the velocity of the descent to prevent overshooting the valley altogether.
Momentum Acceleration of Least-Squares Support Vector Machines
Jorge López, Álvaro Barbero, José R. Dorronsoro
Solving the model - SGD, Momentum and Adaptive Learning Rate
Documentaries, videos and podcasts
- Stochastic gradient descent (SGD)Gradient-based optimization algorithm used in machine learning and deep learning for training artificial neural networks.
- AdagradAdaptive gradient algorithm used for large scale machine learning tasks in distributed environments.
- AdadeltaAn extension of the Adagrad machine learning optimization algorithm which improves upon the two main drawbacks of the method.
- RMSpropUnpublished but widely-known gradient descent optimization algorithm for mini-batch learning of neural networks.
- Machine learningA field of computer science enabling computers to learn.
- Show More