Other attributes
The unstable gradient problem is a fundamental problem during neural network training, where a gradient in a deep neural network tends to either explode or vanish in early layers. The optimization algorithm of gradient descent is used to train neural networks. The training data input to the network helps models learn, and the loss function determines the accuracy of its prediction performance for each iteration. During training, the aim is to reduce the loss function/prediction error by adjusting parameters iteratively. The gradient descent algorithm does this using a forward and backward step:
- Forward propagation—Input vectors move forward through the network using a formula to compute each neuron in the next layer. This formula consists of input/output, the activation function, weighting, and bias. The computation iterates forward through the network until it reaches an output or prediction. Then the difference is calculated by a loss function.
- Backward propagation—After this initial evaluation, backward propagation (backpropagation) is used to adjust the weights and biases of each neuron in each layer. To update the neural network parameters, the gradients (a derivative of the loss function with respect to the weights and biases) are calculated. Then a gradient descent step is applied to minimize the loss function.
Issues can occur when applying the gradient descent algorithm; the derivative term could get extremely small (approaching zero) or extremely large. These issues, referred to as the vanishing and exploding gradients, respectively, prevent the performance of the model from improving during training.
The unstable gradient problem is not necessarily the vanishing gradient problem or the exploding gradient problem but is due to the fact that the gradient in early layers is the product of terms from all proceeding layers. More layers make the network an intrinsically unstable solution. Balancing all the products of terms is the only way each layer in a neural network can learn at close to the same speed and avoid vanishing or exploding gradients. The chances of all these products of terms balancing out becomes more unlikely with more layers. Neural networks, therefore, have layers that learn at different speeds without being given any mechanisms or underlying reason for balancing learning speeds.
When magnitudes of gradients accumulate, unstable networks are more likely to occur, which is a cause of poor prediction results.

