Adagrad improved upon previous gradient descent based algorithms by adaptively scaling the learning rate (η) parameter for each dimension in a system, making it possible to train deep neural networks having millions of dimensions with a process that is neither to volatile and imprecise, nor too slow.
The drawbacks of Adagrad are:
- It has a continually decaying learning rate throughout training, meaning that the learning rate will become infinitesimally small after many iterations.
- It doesn't solve the issue with previous gradient descent algorithms which require manual selection of the global learning rate.
To address Adagrad's drawbacks, Adadelta implements two new ideas.
- Accumulate the sum of squared gradients over a restricted time window rather than over all time. This is different from Adagrad, which can accumulate the sum of squared gradients all the way to infinity. Since this sum is in the denominator and the learning rate in the numerator, the learning rate approaches zero as the sum approaches infinity. By restricting the potential size of the sum, Adadelta ensures that learning continues to make progress even after a large amount of iterations have been completed.
- Correct the mismatch in units that exists in most gradient descent based algorithms. Each time a parameter updates with past algorithms (e.g. Adagrad, SGD, or Momentum) the units relate to the gradient rather than the parameter. This means that the units of the update don't match up with the units of the parameter they're updating. To address this drawback, Adadelta uses a Hessian approximation which is always positive in place of the learning, ensuring that the update direction always follows the negative gradient at each step as in SGD. This eliminates the learning rate from the update rule completely, meaning that there's no longer a requirement to manually set it.
Matthew D. Zeiler
ADADELTA: An Adaptive Learning Rate Method
Matthew D. Zeiler
An overview of gradient descent optimization algorithms
Documentaries, videos and podcasts
- AdagradAdaptive gradient algorithm used for large scale machine learning tasks in distributed environments.
- Stochastic gradient descent (SGD)Gradient-based optimization algorithm used in machine learning and deep learning for training artificial neural networks.
- Adam (support vector machine)A gradient-based machine learning optimization algorithm that computes individual adaptive learning rates for each parameter, combining the advantages of Adagrad and RMSprop.
- RMSpropUnpublished but widely-known gradient descent optimization algorithm for mini-batch learning of neural networks.
- Momentum (support vector machine)A machine learning strategy that helps accelerate stochastic gradient descent in the relevant direction while dampening oscillations.
- Show More