Optimizers in Deep Learning: Role of Gradient Descent, Types, and Key Challenges

Aryan
Dec 20, 2025
3 min read

ROLE OF OPTIMIZER

In a neural network, we work with weights and biases, and the number of these parameters varies depending on the architecture. Our goal is to find the best possible values for these parameters so that the model performs well during predictions. In simple terms, training a neural network is an optimization problem where we try to minimize the error and make our model’s predictions as close as possible to the real data.

At the beginning, we initialize the weights and biases with random values. These initial values usually produce a high loss. The loss function measures how far our predictions are from the actual outputs, and it is directly dependent on our model parameters. Our objective is to update these parameters step-by-step so that the loss gradually decreases and eventually reaches a point where it becomes minimal. Ideally, we want to reach the global minimum of the loss function.

To achieve this, we use an optimizer. In deep learning, one of the most commonly used optimizers is Gradient Descent. It updates the model parameters using the following rule:

Here,

w represents the weights,
L is the loss function, and
η (eta) is the learning rate, which controls the step size.

We repeat this update many times across multiple epochs, and with each iteration, we move closer to the optimal values of the parameters. Eventually, the network converges near the global minimum of the loss function, giving us improved prediction performance.

TYPES OF OPTIMIZERS

In deep learning, Gradient Descent is generally applied in three different ways:

Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent. Below is a brief overview of each:

Batch Gradient Descent

This method uses the entire training dataset to compute the gradient of the loss function and update the weights. It provides a stable and accurate direction toward the minimum but can be computationally expensive for large datasets.

Stochastic Gradient Descent (SGD)

Instead of using the full dataset, SGD updates the parameters using just one training sample at a time. It makes updates very frequently, which leads to faster iterations and helps escape local minima, but the training path becomes more noisy and less stable.

Mini-Batch Gradient Descent

This is a practical and widely used approach where gradients are calculated using a small subset (mini-batch) of the training data. It balances the stability of batch gradient descent and the speed of SGD, making it the most commonly used version in modern deep learning.

CHALLENGES

Why do we need other optimizers even though we already have these three variants of Gradient Descent? The reason is that, in practice, Gradient Descent faces several challenges:

Learning Rate Sensitivity

The update rule includes a learning rate, and choosing it is extremely crucial. If the learning rate is too small, convergence becomes very slow. If it is too large, the updates can become unstable, causing the model to overshoot and possibly never reach the minimum. Ideally, when we are far from the minimum, larger steps help, and when we are close, smaller steps help—but a single fixed learning rate cannot handle both situations efficiently.

Learning Rate Scheduling Limitations

Learning rate scheduling tries to change the learning rate automatically according to a predefined schedule. However, these schedules and thresholds must be defined manually, and different datasets may require different schedules. A schedule that works for one dataset might not work efficiently for another, which makes it difficult to apply universally.

Multiple Parameters and Directions

Gradient Descent uses one learning rate for all parameters, but different weights may require different update speeds. The same learning rate in every direction means the model moves with equal speed in all parameter dimensions, which may not be ideal. If some parameters need faster updates and others slower ones, standard gradient descent cannot assign separate learning rates for different directions.

Local Minima Problem

The loss landscape of deep learning models may contain multiple minima. Our goal is to reach the global minimum, but sometimes the optimizer can get stuck in a local minimum and struggle to escape. Stochastic Gradient Descent may help due to its noisy updates, but Batch and Mini-Batch Gradient Descent have a higher chance of getting stuck, leading to sub-optimal solutions.

Saddle Points and Flat Regions

A saddle point is a point where the slope increases in one direction and decreases in another, often forming a flat plateau. At saddle points, the gradient becomes very small or zero, so there is no useful update, and training progress can stall. This makes saddle points another major challenge for basic gradient descent approaches.

Optimizers in Deep Learning: Role of Gradient Descent, Types, and Key Challenges

Recent Posts

© 2025 Aryan Upadhyay |