Golden
Adadelta

Adadelta

An extension of the Adagrad machine learning optimization algorithm which improves upon the two main drawbacks of the method.

Adadelta is a machine learning optimization algorithm that was created by Matthew D. Zeiler with the goal of addressing two drawbacks of the Adagrad method.

Background

Adagrad improved upon previous gradient descent based algorithms by adaptively scaling the learning rate (η) parameter for each dimension in a system, making it possible to train deep neural networks having millions of dimensions with a process that is neither to volatile and imprecise, nor too slow.



The drawbacks of Adagrad are:

  • It has a continually decaying learning rate throughout training, meaning that the learning rate will become infinitesimally small after many iterations.
  • It doesn't solve the issue with previous gradient descent algorithms which require manual selection of the global learning rate.

Advances of Adadelta

To address Adagrad's drawbacks, Adadelta implements two new ideas.



  1. Accumulate the sum of squared gradients over a restricted time window rather than over all time. This is different from Adagrad, which can accumulate the sum of squared gradients all the way to infinity. Since this sum is in the denominator and the learning rate in the numerator, the learning rate approaches zero as the sum approaches infinity. By restricting the potential size of the sum, Adadelta ensures that learning continues to make progress even after a large amount of iterations have been completed.
  2. Correct the mismatch in units that exists in most gradient descent based algorithms. Each time a parameter updates with past algorithms (e.g. Adagrad, SGD, or Momentum) the units relate to the gradient rather than the parameter. This means that the units of the update don't match up with the units of the parameter they're updating. To address this drawback, Adadelta uses a Hessian approximation which is always positive in place of the learning, ensuring that the update direction always follows the negative gradient at each step as in SGD. This eliminates the learning rate from the update rule completely, meaning that there's no longer a requirement to manually set it. 



Timeline

People

Name
Role
Related Golden topics

Matthew D. Zeiler

Creator



Further reading

Title
Author
Link
Type
Date

ADADELTA: An Adaptive Learning Rate Method

Matthew D. Zeiler

PDF



An overview of gradient descent optimization algorithms

Sebastian Ruder

Web



Documentaries, videos and podcasts

Title
Date
Link





Companies

Company
CEO
Location
Products/Services









References