Problems with RNNs: Vanishing and Exploding Gradients Explained

Aryan
Jan 30
5 min read

PROBLEM WITH RNN

Recurrent Neural Networks (RNNs) are neural network architectures that are specifically designed for sequential data. When data follows a sequence, RNNs work well for tasks such as textual data processing and time-series analysis. However, in practice, RNNs are not used as much today because they suffer from two major problems. Due to these limitations, other architectures are often preferred for sequential data.

The first problem is long-term dependency, and the second problem is stagnated training (often linked to unstable gradients). Both issues arise from the behavior of gradients during training. Let us first focus on the long-term dependency problem.

In sequential data, later parts of the sequence often depend on earlier parts. In RNNs, when the sequence becomes long—meaning the number of timestamps increases—the network struggles to retain information from the early timestamps. As a result, important early details are gradually lost. This inability to remember distant information is known as the long-term dependency problem.

Consider a next-word prediction task. If we use the sentence “Marathi is spoken in Maharashtra”, the word “Maharashtra” depends directly on “Marathi”. This is a short sentence, so the dependency is short-term and RNNs can handle it well. Now consider a longer example: “Maharashtra is a beautiful place. I went there last year, but I could not enjoy it properly because I don’t understand Marathi.” Here, the word “Marathi” depends on “Maharashtra”, but several words appear in between. This creates a long-term dependency across many timestamps. In such cases, RNNs often fail to capture this relationship. This is a classic example of the long-term dependency problem, where the model does not have enough memory to retain information from far earlier in the sequence.

This problem mainly arises due to the vanishing gradient issue during training. As gradients are propagated back through many timestamps, they become very small, causing the influence of early inputs to disappear.

The second major issue is stagnated training. Sometimes, RNNs do not train properly: the loss does not improve, and the model fails to converge to good results. One important reason for this behavior is the exploding gradient problem, where gradients become excessively large, making training unstable and preventing effective learning.

These two problems—long-term dependency and stagnated training—are the key reasons why standard RNNs are limited and why more advanced architectures are often preferred in practice.

PROBLEM 1 – PROBLEM OF LONG-TERM DEPENDENCY

Suppose we have a dataset with input–output pairs as shown below:

Input	Output
101	1
001	0
000	0
111	1

Here, each digit acts as an input at a particular time step. Since each input has three digits, we have three timestamps. For this task, we construct a simple RNN architecture with a single hidden layer containing one neuron.

The model has three sets of weights: input-to-hidden weight wᵢₙ, hidden-to-hidden (recurrent) weight wₕ, and hidden-to-output weight wₒᵤₜ.

When we train this RNN on the dataset, the network unfolds across time according to the number of timestamps.

During training, we use backpropagation through time (BPTT) to find the values of wᵢₙ, wₕ, and wₒᵤₜ that minimize the loss using gradient descent. The weight update equations are:

We start with initial values for the weights, compute the forward pass, calculate the loss, find the derivatives, and then update the weights over multiple epochs. The challenging part appears during derivative calculation. The loss does not depend on a single weight directly; instead, it depends on multiple outputs across different timestamps.

For example, the gradient of the loss with respect to wᵢₙ can be written as:

This expression represents how a change in wᵢₙ affects the loss through different timestamps. The first term corresponds to the current timestamp, which is a short-term dependency. The second and third terms correspond to earlier timestamps, with the third term representing the long-term dependency. Since xᵢ₁ is far from the final output, its contribution to the loss is much weaker compared to the latest input xᵢ₃, which is closer and therefore has a stronger effect.

In real-world problems, sequences can be much longer—for example, 100 timestamps. In that case, the gradient becomes:

This can be compactly written as:

Now consider the hidden state equations:

O₁ = tanh(xᵢ₁ wᵢₙ + O₀ wₕ)

Oₜ = tanh(xᵢₜ wᵢₙ + Oₜ₋₁ wₕ)

The derivative of Oₜ with respect to Oₜ₋₁ is:

Since the derivative of the tanh function lies between 0 and 1, the gradient expression becomes:

If wₕ is chosen between 0 and 1, then each term inside the product is a small number. Multiplying many such small values across a large number of timestamps causes the entire product to shrink toward zero. As a result, the contribution of early timestamps becomes negligible, and the gradient is dominated by recent inputs. This is exactly the long-term dependency problem: as the sequence length increases, early information has almost no influence on the final output, and short-term dependencies dominate learning.

SOLUTIONS

Use different activation functions. With tanh, the derivative lies between 0 and 1, which leads to vanishing gradients. Alternatives like ReLU or Leaky ReLU do not strictly constrain derivatives to this range.
Use better weight initialization. Initializing recurrent weights close to 1, or using identity matrices, can help preserve gradients across time.
Use skip or residual RNN connections to reduce the effective distance between distant timestamps.
Move to more advanced architectures such as LSTM, which are specifically designed to handle long-term dependencies.

PROBLEM 2 – UNSTABLE TRAINING (EXPLODING GRADIENT)

Now consider the opposite situation, where long-term dependencies become so strong that they start dominating the output. In this case, the influence of short-term dependencies is overwhelmed. This leads to the exploding gradient problem, which causes unstable training in RNNs.

Suppose we use the ReLU activation function instead of tanh. Unlike tanh, whose derivative lies between 0 and 1, the derivative of ReLU can take any positive value. If the recurrent weights are initialized with values close to or equal to 1, then during backpropagation these derivatives get multiplied repeatedly across many timestamps. As a result, the gradient values can grow very large and eventually blow up.

When gradients explode, they dominate the update step, causing abrupt and unstable weight changes. In such cases, training may stop improving altogether, or the loss may fluctuate wildly instead of converging. A high learning rate further amplifies this effect, making the training process even more unstable and preventing the model from learning properly. Because of this risk, standard RNNs often use the tanh activation function by default, as it helps keep gradients under control.

SOLUTIONS

Gradient clipping: Limit the gradient to a maximum value during backpropagation. This prevents gradients from growing excessively large and stabilizes training in many situations.
Controlled learning rate: Using an appropriate learning rate is crucial. If the learning rate is too high, even moderate gradients can cause training to become unstable.
Move to LSTM: Long Short-Term Memory networks are designed to handle both vanishing and exploding gradient problems more effectively than standard RNNs.

Problems with RNNs: Vanishing and Exploding Gradients Explained

Recent Posts

© 2025 Aryan Upadhyay |