Adagrad

Adagrad is an adaptive gradient algorithm that has demonstrated good results on large scale machine learning tasks (e.g. training deep neural networks) in distributed environments.

Machine learning algorithms use a hyper-parameter, the learning rate (η), which is chosen manually for each iteration of the training. The learning rate is a reflection of how much the algorithm will automatically shift weights and biases in the direction opposite to the gradient of the previous iteration, where the end goal is to minimize the loss function to an acceptable level. As an adaptive gradient algorithm, Adagrad computes individual learning rates for different parameters. In that context, Adagrad is recognized as a significant improvement in the robustness of stochastic gradient descent.

Finding an optimal learning rate for a machine learning algorithm can be a complex problem, particularly when there are many thousands or millions of dimensions to vary, as is the case for deep neural networks. If the learning rate is set too high, the parameter may move with too much volatility to ever reach an acceptable level of loss. On the other hand, setting the learning rate too low will result in very slow progress.

Adagrad was designed for use in these large scale problems where it's impractical to manually choose different learning rates for each dimension of the problem due to the volume of the dimensions. It adaptively scales the learning rate parameter for each dimension to ensure that the training process is neither too slow nor too volatile and imprecise. To do this, AdaGrad algorithms dynamically incorporate knowledge of the geometry of the data observed in past iterations. It then applies that knowledge to set lower learning rates for more frequently occurring features and higher learning rates for relatively infrequently occurring features. As a result, the infrequent features stand out, enabling the learner to identify rare, yet highly predictive features.

Use of Adagrad at Google X Lab

In 2012, Google X lab successfully used Adagrad to train a neural network of 16,000 computer processors and 1 billion connections to browse random YouTube videos and recognize high-level features. Over the course of a three day trial, the system achieved 81.7% accuracy in detecting human faces, 76.7% accuracy when identifying human body parts, and 74.8% accuracy when identifying cats. This was unsupervised learning, meaning that the algorithm wasn't fed any information to help it identify distinguishing features and wasn't viewing pre-labled images or videos.

One of the disadvantages of Adagrad is that its optimization method can result in aggressive, monotonically decreasing learning rates. Some related machine learning algorithms were developed to address this issue.

Adadelta is an extension of Adagrad that attempts to solve its radically diminishing learning rates.
RMSprop is an unpublished but widely known optimization algorithm that also shares similarities with Adagrad and addresses its radically diminishing learning rates to make it more useful for mini-batch learning.
Adam is essentially a combination of RMSprop and Stochastic Gradient Descent with momentum.

Timeline

No Timeline data yet.

Further Resources

Title

Author

Link

Type

Date

ADADELTA: An Adaptive Learning Rate Method

Matthew D. Zeiler

https://arxiv.org/pdf/1212.5701.pdf

Journal

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

John Duchi, Elad Hazan, Yoram Singer

http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf

Journal

Google's Artificial Brain Learns to Find Cat Videos

Liat Clark

https://www.wired.com/2012/06/google-x-neural-network/

Web

Adagrad

Contents

Timeline

Further Resources

References

Find more entities like Adagrad