Adaptive gradient algorithm used for large scale machine learning tasks in distributed environments.
In 2012, Google X lab successfully used Adagrad to train a neural network of 16,000 computer processors and 1 billion connections to browse random YouTubeYouTube videos and recognize high-level features. Over the course of a three day trial, the system achieved 81.7% accuracy in detecting human faces, 76.7% accuracy when identifying human body parts, and 74.8% accuracy when identifying cats. This was unsupervised learning, meaning that the algorithm wasn't fed any information to help it identify distinguishing features and wasn't viewing pre-labled images or videos.
Adaptive gradient algorithm used for large scale machine learning tasks in distributed environments.
Adagrad is an adaptive gradient algorithm that has demonstrated good results on large scale machine learning tasks (e.g. training deep neural networks) in distributed environments.
Machine learning algorithms use a hyper-parameter, the learning rate (η), which is chosen manually for each iteration of the training. The learning rate is a reflection of how much the algorithm will automatically shift weights and biases in the direction opposite to the gradient of the previous iteration, where the end goal is to minimize the loss function to an acceptable level. As an adaptive gradient algorithm, Adagrad computes individual learning rates for different parameters. In that context, Adagrad is recognized as a significant improvement in the robustness of stochastic gradient descent.
Finding an optimal learning rate for a machine learning algorithm can be a complex problem, particularly when there are many thousands or millions of dimensions to vary, as is the case for deep neural networks. If the learning rate is set too high, the parameter may move with too much volatility to ever reach an acceptable level of loss. On the other hand, setting the learning rate too low will result in very slow progress.
Adagrad was designed for use in these large scale problems where it's impractical to manually choose different learning rates for each dimension of the problem due to the volume of the dimensions. It adaptively scales the learning rate parameter for each dimension to ensure that the training process is neither too slow nor too volatile and imprecise. To do this, AdaGrad algorithms dynamically incorporate knowledge of the geometry of the data observed in past iterations. It then applies that knowledge to set lower learning rates for more frequently occurring features and higher learning rates for relatively infrequently occurring features. As a result, the infrequent features stand out, enabling the learner to identify rare, yet highly predictive features.
In 2012, Google X lab successfully used Adagrad to train a neural network of 16,000 computer processors and 1 billion connections to browse random YouTube videos and recognize high-level features. Over the course of a three day trial, the system achieved 81.7% accuracy in detecting human faces, 76.7% accuracy when identifying human body parts, and 74.8% accuracy when identifying cats. This was unsupervised learning, meaning that the algorithm wasn't fed any information to help it identify distinguishing features and wasn't viewing pre-labled images or videos.
One of the disadvantages of Adagrad is that its optimization method can result in aggressive, monotonically decreasing learning rates. Some related machine learning algorithms were developed to address this issue.
Adaptive gradient algorithm used for large scale machine learning tasks in distributed environments.