Deep Learning Optimizers Explained: NAG, Adagrad, RMSProp, and Adam

Aryan
Jan 5
10 min read

WHAT IS NESTEROV ACCELERATED GRADIENT (NAG) ?

Nesterov Accelerated Gradient (NAG) is designed to reduce oscillations during optimization so that convergence becomes faster and more stable. It addresses a key limitation of classical momentum, where the optimizer continues to oscillate even after reaching close to the optimal value.

With NAG, the optimizer reaches the optimal region quickly and then converges with fewer oscillations. In short, it reduces unnecessary oscillations near the minimum and helps achieve faster and smoother convergence.

MATHEMATICAL INTUITION

First, let us recall the update equations for momentum-based gradient descent:

vₜ = β vₜ₋₁ + η ∇wₜ

wₜ₊₁ = wₜ − vₜ

Here, β is the momentum (decay) factor and η is the learning rate.

Momentum can be seen as a modification of standard gradient descent. In standard gradient descent, the update rule is:

wₜ₊₁ = wₜ − η ∇wₜ

So the update depends only on the current gradient. In momentum, however, the update comes from two components: the past velocity term β vₜ₋₁ and the current gradient term η ∇wₜ.

If we substitute vₜ into the update equation, we can see that the weight update is influenced by both past gradients (stored in velocity) and the current gradient. These two terms together decide how the optimizer moves toward the minimum. Because momentum accumulates past gradients, it often results in larger jumps, helping the optimizer move faster in the direction of the minimum.

This is why momentum accelerates convergence but can also cause overshooting and oscillations near the optimum. A small mathematical tweak to this idea leads to Nesterov Accelerated Gradient.

NESTEROV ACCELERATED GRADIENT

In classical momentum, the update direction is decided using the past velocity β vₜ₋₁ and the gradient computed at the current position ∇wₜ.

In NAG, instead of computing both together, we first apply the momentum step and move ahead, and then compute the gradient at this new position. This “look-ahead” behavior allows NAG to correct its direction earlier.

First, we compute the look-ahead position:

wₗₐ = wₜ − β vₜ₋₁

Then, we compute the velocity using the gradient at this look-ahead point:

vₜ = β vₜ₋₁ + η ∇wₗₐ

Finally, the parameters are updated as:

wₜ₊₁ = wₜ − vₜ

In this way, NAG first anticipates where the momentum term will take the parameters, evaluates the gradient at that position, and then performs the update. Because the gradient is computed after looking ahead, NAG is able to reduce overshooting and oscillations, leading to faster and more stable convergence.

GEOMETRIC INTUITION

MOMENTUM

In momentum-based gradient descent, at every step we compute two quantities: the accumulated momentum (velocity) and the current gradient. At the beginning, when momentum is zero, the update is dominated by the gradient. If the slope at the starting point w₀ is negative, the parameters move downhill toward the minimum.

As updates continue, momentum starts building up from previous gradients. When the parameters cross the minimum, the gradient direction changes and tries to pull the parameters back toward the minimum. However, the momentum term may still dominate, pushing the parameters further in the same direction. This causes overshooting and oscillations around the minimum.

Over time, as the direction of gradients keeps changing, the momentum term gradually reduces. Eventually, the combined effect of momentum and gradient brings the parameters closer to the minimum, and the oscillations decrease. This is how momentum accelerates convergence while still oscillating around the optimum before settling.

NESTEROV ACCELERATED GRADIENT (NAG)

In NAG, the update is not computed using momentum and gradient at the same point. Instead, we first apply the momentum step and move ahead, and then compute the gradient at this new position.

Geometrically, NAG looks ahead to estimate where the momentum term will take the parameters. At this look-ahead point, the gradient is evaluated. If this gradient points in the opposite direction, the optimizer corrects its course early and moves back toward the minimum.

This results in a small corrective turn rather than a large overshoot. The key difference is that NAG separates the momentum step and gradient calculation, allowing it to anticipate future movement. Because of this look-ahead behavior, NAG significantly dampens oscillations and converges more smoothly compared to classical momentum.

DISADVANTAGES

Since NAG aggressively dampens oscillations, it can sometimes reduce the ability of the optimizer to escape shallow local minima. By limiting oscillatory behavior too much, the optimizer may settle into a local minimum instead of exploring further. This tendency to get trapped in local minima is considered one of the main drawbacks of Nesterov Accelerated Gradient.

Adagrad: The Adaptive Gradient Optimizer

Adagrad (Adaptive Gradient) is based on a simple yet powerful idea: we should not rely on a fixed learning rate for all parameters. Instead, the algorithm adapts the learning rate dynamically based on the behavior of each parameter during training.

Rather than applying the same update everywhere, Adagrad adjusts learning rates to fit the specific situation at hand.

When Does Adagrad Perform Best ?

Adagrad performs particularly well in two scenarios where standard optimization algorithms often struggle.

1. Features with Different Scales

Consider a dataset containing features like CGPA (scaled from 0–10) and Salary (scaled in thousands). When feature ranges differ significantly, standard Gradient Descent takes a long time to converge and often oscillates inefficiently. Adagrad handles these scale differences effectively by adapting learning rates for each parameter.

2. Sparse Data

Adagrad is especially effective for sparse datasets, where most feature values are zero. This sparsity often produces an “elongated bowl” shape in the loss landscape, which is difficult for standard optimizers to navigate efficiently.

Understanding the Loss Landscape

The visualizations illustrate contour plots of the loss function with respect to parameters w and b.

Circular Plot (Right): This represents well-normalized features. The contours are circular, allowing an optimizer to move directly toward the minimum with minimal oscillation.
Elongated Plot (Left): This occurs when features are sparse or poorly scaled. The contours stretch into an elongated ellipse. In such landscapes, the slope changes rapidly along one axis (steep direction) while remaining relatively flat along the other.

The Solution

In elongated valleys, standard optimization algorithms tend to oscillate along the steep direction or make very slow progress along the flat direction. Adagrad addresses this problem by adjusting the update step for each parameter individually.

By shrinking the learning rate for frequently updated parameters and maintaining larger updates for infrequent ones, Adagrad navigates these difficult landscapes more smoothly and converges more reliably.

PROBLEMS

Batch Gradient Descent

In batch gradient descent, the parameter update initially moves downhill toward the minimum. However, when the loss surface is elongated, the optimizer starts progressing extremely slowly along the stretched axis. This inefficiency arises because the curvature of the loss function differs significantly across directions, causing uneven progress during optimization.

Momentum

Momentum helps accelerate training and reduces oscillations, but in elongated loss landscapes the core issue still remains.

Along the b-axis, movement is relatively fast.
Along the m-axis, movement becomes very slow because the feature associated with this parameter is sparse.

This imbalance makes the path to the minimum inefficient. Although momentum smooths updates, when one feature is dense and another is sparse, it still struggles to move optimally toward the minimum.

Why Batch Gradient Descent and Momentum Struggle with Sparse Features

Both batch gradient descent and momentum rely heavily on gradient magnitudes.

When a feature is mostly zeros (sparse), its gradient contribution becomes very small. In contrast, features with continuous or non-sparse values generate stronger gradients.

Dense features produce large gradient updates, leading to faster movement.
Sparse features often produce near-zero gradients, resulting in very small updates.

As a result, the optimization path becomes:

Normal or fast along dense features
Very slow along sparse features

This mismatch creates an elongated contour shape in the loss landscape and significantly slows down convergence.

Why Movement Differs: A Simple Intuition

Consider a very small neural network with a single neuron.

We take inputs x and target y, with weight w and bias b, and generate predictions.

During training, batch gradient descent updates the parameters as:

For simplicity, assume a linear activation. The gradients then become:

If we have 100 training samples:

For each row, we compute the gradient
Multiply it by the learning rate
Then average or sum the updates depending on the batch setting

Now comes the key insight.

If the input feature x is sparse, many values of x are zero. As a result:

−2(y − ŷ)x = 0

This makes the update extremely small, which means movement in the w-direction is tiny.

In contrast, the bias term b is not multiplied by any input feature. Its gradient:

−2(y − ŷ)

is usually non-zero, leading to larger updates and faster movement.

This explains the geometric behavior:

Sparse feature → slow movement → elongated loss contours
Dense feature → normal movement → circular loss contours

During early training, the optimizer moves steeply downward along the dense direction, but then slowly drifts along the elongated direction toward the minimum. Because the derivative for the sparse feature becomes zero repeatedly, progress in that direction remains extremely slow.

Mathematical Intuition: How Adagrad Adapts

We have seen that sparse features lead to small updates (slow movement), while dense features lead to large updates (fast movement). To correct this imbalance, the optimizer must normalize these movements so that all parameters progress at comparable speeds.

The Core Idea

Since we cannot directly change the gradient (it is determined by the data), the only lever we can control is the learning rate η.

Standard optimizers use a single, fixed learning rate for all parameters.
Adagrad assigns a separate, adaptive learning rate to each parameter.

This adaptation follows a simple rule:

If a parameter has a large gradient (steep slope or frequent updates), Adagrad reduces its effective learning rate to avoid overshooting.
If a parameter has a small gradient (flat slope or sparse updates), Adagrad maintains or relatively increases its effective learning rate so that learning continues.

The Algorithm and Update Rule

For a parameter w at time step t, the update rule is:

Where:

gₜ is the gradient at time step t (∂L/∂w).
η is the initial global learning rate.
vₜ is the accumulated sum of squared past gradients, which acts as the algorithm’s “memory”:

ε is a small constant (e.g., 10⁻⁸) added to prevent division by zero.

Why Do We Square the Gradients ?

Magnitude over Direction

We care about how large the gradient is, not whether it is positive or negative. Squaring ensures all contributions are positive.
Differentiability

Unlike the absolute value ∣g∣, the square function g² is smooth and differentiable everywhere, which is important for gradient-based optimization.

How It Works in Practice

Large gradients: When gradients are consistently large, vₜ grows quickly. Dividing η by a large value reduces the effective learning rate, leading to smaller, safer updates.
Small gradients: When gradients are small (as in sparse features), vₜ remains small. Dividing by a small value keeps the effective learning rate relatively large, allowing meaningful progress.

Disadvantages: The Vanishing Learning Rate

While Adagrad handles sparse data effectively, it introduces a critical limitation.

Because vₜ = ∑ (gᵢ)² is a sum of squared values, it can only increase over time; it never decreases. As training progresses and more updates are accumulated, vₜ can become very large.

The Consequence

Eventually, the effective learning rate becomes so small that parameter updates nearly stop. The optimizer may converge prematurely and fail to reach the global minimum. This drawback motivated the development of improved optimizers such as RMSProp and Adam, which address this vanishing learning rate issue.

RMSProp

RMSProp stands for Root Mean Square Propagation. It is an optimization algorithm introduced as an improvement over Adagrad.

Adagrad works well when we have sparse data, meaning some features contain many zero values. In such cases, the loss surface becomes elongated, making optimization difficult for Batch Gradient Descent and Momentum, which take a long time to reach the minimum. Adagrad addresses this problem by adapting the learning rate for each parameter.

However, Adagrad has a major drawback. Since it keeps accumulating squared gradients over the entire training history, the effective learning rate keeps decreasing. After some point, the learning rate becomes so small that updates are almost negligible, and the algorithm effectively gets stuck before proper convergence.

RMSProp solves this exact problem.

MATHEMATICAL FORMULATION

Adagrad

In Adagrad, the accumulated gradient term is computed as:

vₜ = vₜ₋₁ + (∇ wₜ)²

Because this term keeps growing without bound, the learning rate keeps shrinking. With sparse features, this can result in extremely small updates, causing the optimizer to stall.

RMSProp

RMSProp makes a small but crucial modification. Instead of accumulating the full history of squared gradients, it uses an exponentially weighted moving average:

vₜ = β vₜ₋₁ + (1 − β)(∇ wₜ)²

The parameter update rule remains:

Here, β is typically set around 0.9 or 0.95.

Unlike Adagrad, RMSProp does not give equal importance to all past gradients. Recent gradients receive more weight, while older gradients gradually decay. Because of this exponential weighting, vₜ does not grow uncontrollably, and the learning rate does not vanish too quickly. As a result, updates continue to happen and the optimizer does not get stuck.

Why RMSProp Does Not Get Stuck (Intuition)

Let us assume β = 0.95 and v₀ = 0 :

v₁ = 0.95 × 0 + 0.05(∇w₁)²

v₂ = 0.95 × 0.05(∇w₁)² + 0.05(∇w₂)²

v₃ = 0.95² × 0.05(∇w₁)² + 0.95 × 0.05(∇w₂)² + 0.05(∇w₃)²

v₄ = 0.95[v₃] + 0.05(∇w₄)²

As training progresses, older gradients contribute less due to repeated multiplication by β, while recent gradients dominate. This prevents vₜ from inflating indefinitely. Since vₜ remains bounded, the effective learning rate stays meaningful, allowing continued learning without stagnation.

DISADVANTAGES

RMSProp does not have a strict theoretical convergence guarantee, and its performance depends on the choice of hyperparameters such as η and β . However, empirically, it has been shown to work very well across a wide range of neural network architectures and tasks.

Because of its stability and effectiveness, RMSProp became one of the most widely used optimizers and directly influenced the development of more advanced methods such as Adam.

ADAM

Adam stands for Adaptive Moment Estimation. It is one of the most widely used optimization algorithms in deep learning and is commonly applied in ANN, CNN, and RNN models. In practice, Adam is often the default choice because it performs well across a wide range of problems.

Adam combines the ideas of momentum and adaptive learning rates, taking inspiration from both Momentum and Adagrad/RMSProp.

MATHEMATICAL INTUITION

The parameter update rule in Adam is:

Where the first- and second-moment estimates are computed as:

mₜ = β₁mₜ₋₁+(1−β₁)∇wₜ

vₜ = β₂vₜ₋₁+(1−β₂)(∇wₜ)²

Here:

mₜ represents the first moment (mean) of the gradients and comes from the idea of momentum.
vₜ represents the second moment (uncentered variance) of the gradients and comes from Adagrad/RMSProp-style adaptive learning rates.

Bias Correction

Since both mₜ and vₜ are initialized to zero, they are biased toward zero during the early training steps. To correct this initialization bias, Adam applies bias correction:

Here, t denotes the time step (or iteration).

Typical values are β₁ = 0.9 and 𝛽₂ = 0.99 (sometimes 0.999), though these are configurable.

Bias correction ensures that the moment estimates are accurate in the early stages of training, preventing overly small updates at the beginning.

Verdict

In most deep learning tasks, Adam works very well and is a strong default optimizer. However, there is no single optimizer that is best for all problems. In some cases, alternatives like RMSProp or SGD with momentum may perform better.

A practical approach is to start with Adam, evaluate performance, and then experiment with other optimizers if needed. In short, Adam generally provides reliable results, but empirical testing always matters.

Deep Learning Optimizers Explained: NAG, Adagrad, RMSProp, and Adam

Recent Posts

© 2025 Aryan Upadhyay |