GoldenGolden
Advanced Search
Adagrad

Adagrad

Adaptive gradient algorithm used for large scale machine learning tasks in distributed environments.

All edits by  Daniel Frumkin 

Edits on 6 Mar, 2019
Daniel Frumkin
Daniel Frumkin approved a suggestion from Golden's AI on 6 Mar, 2019
Edits made to:
Article (+7/-7 characters)
Article

In 2012, Google X lab successfully used Adagrad to train a neural network of 16,000 computer processors and 1 billion connections to browse random YouTubeYouTube videos and recognize high-level features. Over the course of a three day trial, the system achieved 81.7% accuracy in detecting human faces, 76.7% accuracy when identifying human body parts, and 74.8% accuracy when identifying cats. This was unsupervised learning, meaning that the algorithm wasn't fed any information to help it identify distinguishing features and wasn't viewing pre-labled images or videos.

Daniel Frumkin"Created page"
Daniel Frumkin edited on 6 Mar, 2019
Edits made to:
Description (+100 characters)
Article (+3226 characters)
People (+3 rows) (+6 cells) (+53 characters)
Further reading (+3 rows) (+12 cells) (+395 characters)
Categories (+1 topics)
Related Topics (+9 topics)
Topic thumbnail

Adagrad

Adaptive gradient algorithm used for large scale machine learning tasks in distributed environments.

Article

Adagrad is an adaptive gradient algorithm that has demonstrated good results on large scale machine learning tasks (e.g. training deep neural networks) in distributed environments.

Machine learning algorithms use a hyper-parameter, the learning rate (η), which is chosen manually for each iteration of the training. The learning rate is a reflection of how much the algorithm will automatically shift weights and biases in the direction opposite to the gradient of the previous iteration, where the end goal is to minimize the loss function to an acceptable level. As an adaptive gradient algorithm, Adagrad computes individual learning rates for different parameters. In that context, Adagrad is recognized as a significant improvement in the robustness of stochastic gradient descent.

Finding an optimal learning rate for a machine learning algorithm can be a complex problem, particularly when there are many thousands or millions of dimensions to vary, as is the case for deep neural networks. If the learning rate is set too high, the parameter may move with too much volatility to ever reach an acceptable level of loss. On the other hand, setting the learning rate too low will result in very slow progress.

Adagrad was designed for use in these large scale problems where it's impractical to manually choose different learning rates for each dimension of the problem due to the volume of the dimensions. It adaptively scales the learning rate parameter for each dimension to ensure that the training process is neither too slow nor too volatile and imprecise. To do this, AdaGrad algorithms dynamically incorporate knowledge of the geometry of the data observed in past iterations. It then applies that knowledge to set lower learning rates for more frequently occurring features and higher learning rates for relatively infrequently occurring features. As a result, the infrequent features stand out, enabling the learner to identify rare, yet highly predictive features.

Use of Adagrad at Google X Lab

In 2012, Google X lab successfully used Adagrad to train a neural network of 16,000 computer processors and 1 billion connections to browse random YouTube videos and recognize high-level features. Over the course of a three day trial, the system achieved 81.7% accuracy in detecting human faces, 76.7% accuracy when identifying human body parts, and 74.8% accuracy when identifying cats. This was unsupervised learning, meaning that the algorithm wasn't fed any information to help it identify distinguishing features and wasn't viewing pre-labled images or videos.

Related Machine Learning Algorithms

One of the disadvantages of Adagrad is that its optimization method can result in aggressive, monotonically decreasing learning rates. Some related machine learning algorithms were developed to address this issue.

  • Adadelta is an extension of Adagrad that attempts to solve its radically diminishing learning rates.
  • RMSprop is an unpublished but widely known optimization algorithm that also shares similarities with Adagrad and addresses its radically diminishing learning rates to make it more useful for mini-batch learning.
  • Adam is essentially a combination of RMSprop and Stochastic Gradient Descent with momentum.
People

Further reading

Categories
Related Topics
Edits on 25 Feb, 2019
Daniel Frumkin"Initial topic creation"
Daniel Frumkin created this topic on 25 Feb, 2019
Edits made to:
Topic thumbnail

 Adagrad

Adaptive gradient algorithm used for large scale machine learning tasks in distributed environments.

Golden logo
Text is available under the Creative Commons Attribution-ShareAlike 4.0; additional terms apply. By using this site, you agree to our Terms & Conditions.