The vanishing gradient problem can occur in artificial neural networks trained using gradient descent with backpropagation. When training such a network, the gradient of the loss function is used to adjust the weights of the network on each iteration. The vanishing gradient problem occurs when the gradient is sufficiently small so as to effectively prevent weights from updating during training, blocking the network from learning.
Networks using activation functions whose derivatives tend to be very close to zero (such as the sigmoid function) are especially susceptible to the vanishing gradient problem. These small values get multiplied at each layer of the network, causing layers closer to the input to adjust their weights very slowly or not at all. Leaky or parametric RELU activation functions provide a robust solution to vanishing gradients by ensuring a derivative sufficiently greater than zero.
Networks using activation functions whose derivatives can take on very large values are susceptible to the related exploding gradient problem.
On the difficulty of training recurrent neural networks
Razvan Pascanu, Tomas Mikolov and Yoshua Bengio
Why are deep neural networks hard to train?