An extension of the Adagrad machine learning optimization algorithm which improves upon the two main drawbacks of the method.

Edits on 7 Mar, 2019

Edits made to:
**Description** (+125 characters)
**Article** (+2051 characters)
**Table** (+1 rows) (+2 cells) (+24 characters)
**Table** (+1 rows) (+4 cells) (+136 characters)

An extension of the Adagrad machine learning optimization algorithm which improves upon the two main drawbacks of the method.

Article

Adadelta is a machine learning optimization algorithm that was created by Matthew D. Zeiler with the goal of addressing two drawbacks of the Adagrad method.

Background

Adagrad improved upon previous gradient descent based algorithms by adaptively scaling the learning rate (η) parameter for each dimension in a system, making it possible to train deep neural networks having millions of dimensions with a process that is neither to volatile and imprecise, nor too slow.

The drawbacks of Adagrad are:

- It has a continually decaying learning rate throughout training, meaning that the learning rate will become infinitesimally small after many iterations.
- It doesn't solve the issue with previous gradient descent algorithms which require manual selection of the global learning rate.

Advances of Adadelta

To address Adagrad's drawbacks, Adadelta implements two new ideas.

**Accumulate the sum of squared gradients over a restricted time window rather than over all time.**This is different from Adagrad, which can accumulate the sum of squared gradients all the way to infinity. Since this sum is in the denominator and the learning rate in the numerator, the learning rate approaches zero as the sum approaches infinity. By restricting the potential size of the sum, Adadelta ensures that learning continues to make progress even after a large amount of iterations have been completed.**Correct the mismatch in units that exists in most gradient descent based algorithms.**Each time a parameter updates with past algorithms (e.g. Adagrad, SGD, or Momentum) the units relate to the gradient rather than the parameter. This means that the units of the update don't match up with the units of the parameter they're updating. To address this drawback, Adadelta uses a Hessian approximation which is always positive in place of the learning, ensuring that the update direction always follows the negative gradient at each step as in SGD. This eliminates the learning rate from the update rule completely, meaning that there's no longer a requirement to manually set it.

Table

Name

Role

Related Golden topics

Matthew D. Zeiler

Creator

Table

Title

Author

Link

Type

An overview of gradient descent optimization algorithms

Sebastian Ruder

Web

Edits on 6 Mar, 2019

Edits made to:
**Table** (+1 rows) (+4 cells) (+97 characters)
**Categories** (+2 topics)
**Related Topics** (+6 topics)

An extension of the Adagrad machine learning optimization algorithm which improves upon the two main drawbacks of the method.

Table

Title

Author

Link

Type

Edits on 25 Feb, 2019

Edits made to:

Text is available under the Creative Commons Attribution-ShareAlike 4.0; additional terms apply. By using this site, you agree to our Terms & Conditions.