The point of a gradient descent optimization algorithm is to minimize a given cost function, such as the loss function in training an artificial neural network.
Improving the performance of a neural network during training is done by slightly adjusting (i.e. tuning or calibrating) its weights and biases, where weights are values that express how important the corresponding inputs are to the output and bias is equivalent to a negative threshold value used to determine the outcome of the training iteration. Each neuron in the network calculates a weighted sum of the inputs and then applies a pre-set activation function to determine whether or not the weighted sum is greater than a threshold value. There are two possible outcomes from this process:
- The weighted sum is greater than or equal to the threshold value: the neuron fires (i.e. activates).
- The weighted sum is less than the threshold value: the neuron doesn't activate.
The loss function tells how well a neural network performs a certain task by producing a value that represents how close the actual output was to the desired output. The gradient descent calculates the slope of the loss function, then shifts the weights and biases according to that slope in order to lower loss in the next iteration. Minimizing loss is done by taking steps in the opposite direction of the gradient until you finally converge on some local minima. This process of minimizing loss is how neural networks learn to better perform specific tasks through training.
The term stochastic describes something that has a random probability distribution or pattern which can be analyzed statistically but not predicated precisely. In order for a gradient descent algorithm to be considered stochastic, it should have a batch size equal to 1, where the one example comprising each batch is chosen at random. When batch size is greater than one, the algorithm is called a mini-batch gradient descent algorithm.
A Bayesian Perspective on Generalization and Stochastic Gradient Descent
Samuel L. Smith, Quoc V. Le
Fully Distributed and Asynchronized Stochastic Gradient Descent for Networked Systems
How do we 'train' neural networks ? - Towards Data Science
Stochastic gradient descent algorithms for strongly convex functions at O(1/T) convergence rates
Stochastic Gradient Descent with momentum - Towards Data Science