Stochastic Gradient Descent (SGD) is a simple gradient-based optimization algorithm used in machine learning and deep learning for training artificial neural networks.

The point of a gradient descent optimization algorithm is to minimize a given cost function, such as the loss function in training an artificial neural network.

Improving the performance of a neural network during training is done by slightly adjusting (i.e. tuning or calibrating) its weights and biases, where weights are values that express how important the corresponding inputs are to the output and bias is equivalent to a negative threshold value used to determine the outcome of the training iteration. Each neuron in the network calculates a weighted sum of the inputs and then applies a pre-set activation function to determine whether or not the weighted sum is greater than a threshold value. There are two possible outcomes from this process:

- The weighted sum is greater than or equal to the threshold value: the neuron fires (i.e. activates).
- The weighted sum is less than the threshold value: the neuron doesn't activate.

The loss function tells how well a neural network performs a certain task by producing a value that represents how close the actual output was to the desired output. The gradient descent calculates the slope of the loss function, then shifts the weights and biases according to that slope in order to lower loss in the next iteration. Minimizing loss is done by taking steps in the opposite direction of the gradient until you finally converge on some local minima. This process of minimizing loss is how neural networks learn to better perform specific tasks through training.

The term *stochastic* describes something that has a random probability distribution or pattern which can be analyzed statistically but not predicated precisely. In order for a gradient descent algorithm to be considered stochastic, it should have a batch size equal to 1, where the one example comprising each batch is chosen at random. When batch size is greater than one, the algorithm is called a mini-batch gradient descent algorithm.

## Timeline

## People

Herbert Robbins

Creator

Sutton Monro

Creator

## Further reading

A Bayesian Perspective on Generalization and Stochastic Gradient Descent

Samuel L. Smith, Quoc V. Le

Academic paper

Fully Distributed and Asynchronized Stochastic Gradient Descent for Networked Systems

Ying Zhang

Academic paper

How do we 'train' neural networks ? - Towards Data Science

Vitaly Bushaev

Web

Stochastic gradient descent algorithms for strongly convex functions at O(1/T) convergence rates

Shenghuo Zhu

Academic paper

Stochastic Gradient Descent with momentum - Towards Data Science

Vitaly Bushaev

Web

The Stochastic Gradient Descent for the Primal L1-SVM Optimization Revisited

Constantinos Panagiotakopoulos, Petroula Tsampouka

Academic paper