Golden Recursion Inc. logoGolden Recursion Inc. logo
Advanced Search


An extension of the Adagrad machine learning optimization algorithm which improves upon the two main drawbacks of the method.

Adadelta is a machine learning optimization algorithm that was created by Matthew D. Zeiler with the goal of addressing two drawbacks of the Adagrad method.


Adagrad improved upon previous gradient descent based algorithms by adaptively scaling the learning rate (η) parameter for each dimension in a system, making it possible to train deep neural networks having millions of dimensions with a process that is neither to volatile and imprecise, nor too slow.

The drawbacks of Adagrad are:

  • It has a continually decaying learning rate throughout training, meaning that the learning rate will become infinitesimally small after many iterations.
  • It doesn't solve the issue with previous gradient descent algorithms which require manual selection of the global learning rate.
Advances of Adadelta

To address Adagrad's drawbacks, Adadelta implements two new ideas.

  1. Accumulate the sum of squared gradients over a restricted time window rather than over all time. This is different from Adagrad, which can accumulate the sum of squared gradients all the way to infinity. Since this sum is in the denominator and the learning rate in the numerator, the learning rate approaches zero as the sum approaches infinity. By restricting the potential size of the sum, Adadelta ensures that learning continues to make progress even after a large amount of iterations have been completed.
  2. Correct the mismatch in units that exists in most gradient descent based algorithms. Each time a parameter updates with past algorithms (e.g. Adagrad, SGD, or Momentum) the units relate to the gradient rather than the parameter. This means that the units of the update don't match up with the units of the parameter they're updating. To address this drawback, Adadelta uses a Hessian approximation which is always positive in place of the learning, ensuring that the update direction always follows the negative gradient at each step as in SGD. This eliminates the learning rate from the update rule completely, meaning that there's no longer a requirement to manually set it.


Further Resources


ADADELTA: An Adaptive Learning Rate Method

Matthew D. Zeiler


An overview of gradient descent optimization algorithms

Sebastian Ruder



Golden logo
By using this site, you agree to our Terms & Conditions.