Machine learning algorithms use a hyper-parameter, the learning rate (η), which is chosen manually for each iteration of the training. The learning rate is a reflection of how much the algorithm will automatically shift weights and biases in the direction opposite to the gradient of the previous iteration, where the end goal is to minimize the loss function to an acceptable level. As an adaptive gradient algorithm, Adagrad computes individual learning rates for different parameters. In that context, Adagrad is recognized as a significant improvement in the robustness of stochastic gradient descent.
Finding an optimal learning rate for a machine learning algorithm can be a complex problem, particularly when there are many thousands or millions of dimensions to vary, as is the case for deep neural networks. If the learning rate is set too high, the parameter may move with too much volatility to ever reach an acceptable level of loss. On the other hand, setting the learning rate too low will result in very slow progress.
Adagrad was designed for use in these large scale problems where it's impractical to manually choose different learning rates for each dimension of the problem due to the volume of the dimensions. It adaptively scales the learning rate parameter for each dimension to ensure that the training process is neither too slow nor too volatile and imprecise. To do this, AdaGrad algorithms dynamically incorporate knowledge of the geometry of the data observed in past iterations. It then applies that knowledge to set lower learning rates for more frequently occurring features and higher learning rates for relatively infrequently occurring features. As a result, the infrequent features stand out, enabling the learner to identify rare, yet highly predictive features.
In 2012, Google X lab successfully used Adagrad to train a neural network of 16,000 computer processors and 1 billion connections to browse random YouTube videos and recognize high-level features. Over the course of a three day trial, the system achieved 81.7% accuracy in detecting human faces, 76.7% accuracy when identifying human body parts, and 74.8% accuracy when identifying cats. This was unsupervised learning, meaning that the algorithm wasn't fed any information to help it identify distinguishing features and wasn't viewing pre-labled images or videos.
One of the disadvantages of Adagrad is that its optimization method can result in aggressive, monotonically decreasing learning rates. Some related machine learning algorithms were developed to address this issue.
- Adadelta is an extension of Adagrad that attempts to solve its radically diminishing learning rates.
- RMSprop is an unpublished but widely known optimization algorithm that also shares similarities with Adagrad and addresses its radically diminishing learning rates to make it more useful for mini-batch learning.
- Adam is essentially a combination of RMSprop and Stochastic Gradient Descent with momentum.
ADADELTA: An Adaptive Learning Rate Method
Matthew D. Zeiler
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
John Duchi, Elad Hazan, Yoram Singer
Google's Artificial Brain Learns to Find Cat Videos
Documentaries, videos and podcasts
- Stochastic gradient descent (SGD)Gradient-based optimization algorithm used in machine learning and deep learning for training artificial neural networks.
- RMSpropUnpublished but widely-known gradient descent optimization algorithm for mini-batch learning of neural networks.
- AdadeltaAn extension of the Adagrad machine learning optimization algorithm which improves upon the two main drawbacks of the method.
- Adam (support vector machine)A gradient-based machine learning optimization algorithm that computes individual adaptive learning rates for each parameter, combining the advantages of Adagrad and RMSprop.
- Support vector machineSet of methods for supervised statistical learning
- Show More