RMSprop first appeared in the lecture slides of a Coursera online class on neural networks taught by Geoffrey Hinton of the University of Toronto. Hinton didn't publish RMSprop in a formal academic paper, but it still became one of the most popular gradient descent optimization algorithms for deep learning.
Hinton developed RMSprop to address the problem that would commonly occur when trying to use rprop with mini-batches, which is that weights would be adjusted proportionally to the magnitude of the gradient of each mini-batch, potentially resulting in very large weight increments or decrements if successive mini-batches don't have similar gradients. This is in contrast to the desired results of stochastic gradient descent, which is making small adjustments to weights and biases in order to calibrate a neural network to perform better and better at a specific task with each iteration of the optimization algorithm. RMSprop also builds on the Adagrad adaptive gradient algorithm by addressing the problem of aggressive, monotonically decreasing learning rates.
How RMSprop Works
In RMSprop, the problem that can occur with rprop if the gradients of successive mini-batches vary by too large an amount is mitigated by using a moving average of the squared gradient for each weight. This means that the gradient of each mini-batch is divided by the square root of the MeanSquare, where the MeanSquare is calculated as:
This process effectively averages the gradients over successive mini-batches so that weights can be finely calibrated.
Neural Networks for Machine Learning - Lecture 6
Geoffrey Hinton, Nitish Srivastava, Kevin Swersky
Understanding RMSprop -- faster neural network learning
Documentaries, videos and podcasts
Lecture 6.5 -- Rmsprop: normalize the gradient [Neural Networks for Machine Learning]
February 4th, 2016
- Stochastic gradient descent (SGD)Gradient-based optimization algorithm used in machine learning and deep learning for training artificial neural networks.
- Rpropa gradient descent algorithm for supervised learning in feedforward artificial neural networks.
- Gradient descentA first-order optimization algorithm
- Machine learningA field of computer science enabling computers to learn.
- Deep learningBranch of machine learning based on learning data representations.
- Show More