Why Weight Initialization Is Important in Deep Learning (Xavier vs He Explained)

Aryan
5 minutes ago
7 min read

Why Weight Initialization is Important ?

The primary goal during neural network training is to find the optimal values for weights to minimize the loss. Weight initialization is the critical first step where we set the initial values of these weights before training begins.

If we do not initialize weights correctly, the model may fail to learn effectively, leading to poor convergence. Improper initialization can trigger several significant issues, such as:

Vanishing Gradient: The gradients become too small, effectively stopping the network from learning.
Exploding Gradient: The gradients become too large, causing unstable updates.
Slow Convergence: The model takes significantly longer to reach an optimal solution.

Note: The vanishing gradient problem is primarily caused by two factors: the use of certain activation functions (like Sigmoid) and incorrect weight initialization.

Wrong Ways of Initialization

Case 1: Zero Initialization

We should never initialize all weights to zero. To understand why, let's look at a regression example using a neural network with two inputs (x₁ , x₂), two hidden neurons, and one output neuron.

The Problem with ReLU Activation

Assume the hidden layer uses the ReLU activation function.

Let a₁₁ and a₁₂ be the activation values of the hidden neurons, where:

a₁₁ = max(0, z₁₁) and z₁₁ = w₁₁⁽¹⁾x₁ + w₂₁⁽¹⁾x₂ + b₁₁

a₁₂ = max(0, z₁₂) and z₁₂ = w₁₂⁽¹⁾x₁ + w₂₂⁽¹⁾x₂ + b₁₂

If we initialize weights and biases to zero, then z₁₁ and z₁₂ become zero. Consequently, the activations a₁₁ and a₁₂ also become zero.

When we apply backpropagation to update the weights using the rule:

Since the output is zero (and for ReLU, the gradient at 0 is effectively zero or undefined), the derivative ∂L/∂w becomes zero.

Result: There is no update. The weights remain stuck at zero, and no training takes place.

The Symmetry Problem (Sigmoid & Tanh)

If we use Tanh or Sigmoid activation functions, the problem changes slightly but remains critical.

Tanh Case: The formula is a = (eᶻ − e⁻ᶻ)/(eᶻ + e⁻ᶻ). If z = 0, then a = 0. While the derivative of Tanh at 0 is 1 (not zero), we still face the symmetry problem described below.
Sigmoid Case: The formula is a = σ(z) . If z = 0, then a = 0.5.

Since all weights are zero, every neuron in the hidden layer receives the same input and computes the exact same output (0.5 for Sigmoid).

Mathematical Proof of Symmetry:

Let's calculate the gradients for the weights. Using the Chain Rule:

Because the weights are identical (zero):

The error terms are identical.
The activation derivatives are identical (a₁₁ and a₁₂).
The inputs are the same (x₁ for both w₁₁ and w₁₂).

Substituting the partial derivatives:

The derivatives for w₁₁ and w₁₂ are identical. Similarly, w₂₁ and w₂₂ are identical.

After the update, the weights change by the exact same amount and remain equal. The hidden units behave as if they are a single neuron, merely duplicated N times.

No matter how many nodes you add to the hidden layer, if they are initialized to zero, they fail to capture non-linearity. The network effectively acts as a simple Linear Model (Perceptron), losing the power of Deep Learning.

Case 2: Non-Zero Constant Initialization

What happens if we initialize all weights and biases to a constant non-zero value, such as 0.5?

At first glance, this might seem better than zero initialization because the neurons are not "dead" (the output is non-zero). However, this approach suffers from the exact same symmetry issue we saw with Sigmoid/Zero-Initialization. This failure applies to all activation functions, including Tanh, ReLU, and Sigmoid.

The Mathematical Breakdown

Let’s assume all weights are initialized to 0.5.

We calculate the pre-activation values (z) for two hidden neurons:

z₁₁ = w₁₁⁽¹⁾x₁ + w₂₁⁽¹⁾x₂ + b₁₁
z₁₂ = w₁₂⁽¹⁾x₁ + w₂₂⁽¹⁾x₂ + b₁₂

Since all weights are identical (w₁₁⁽¹⁾ = w₁₂⁽¹⁾, etc.) and they receive the same inputs (x₁, x₂), the pre-activations are identical:

z₁₁ and z₁₂

Consequently, their activation outputs are also identical:

a₁₁ = max(0, z₁₁) = a₁₂ = max(0, z₁₂)

The Gradient Problem

Because the activations are identical, the backpropagation updates will also be identical. Let's look at the partial derivatives:

Since a₁₁ = a₁₂ and z₁₁ = z₁₂ , these two gradients are exactly the same. The same logic applies to the second set of weights:

Even though the values are non-zero, the weights update by the exact same amount in every iteration. The symmetry is never broken.

The hidden units act as a single neuron duplicated multiple times.
The model fails to capture the complexity of the data and effectively behaves like a Linear Model.

We cannot initialize weights to zero, nor can we initialize them to a constant non-zero value. Both methods prevent the neural network from learning non-linear patterns.

Case 3: Random Initialization (Small Values)

Let's consider a scenario where we initialize weights with very small random values, for example, in the range of -0.01 to 0.01.

Experimental Setup:

Data: 1000 rows, 500 input features (normalized, roughly -1 to 1).
Architecture: 3 hidden layers with 500 nodes each.
Initialization: Weights are generated using np.random.randn multiplied by 0.01. Biases are set to zero.
Activation: Tanh.

The Problem:

Since the weights w are very small (e.g., 0.01), the weighted sum z = ∑ wᵢ xᵢ + b will also be a very small number.

When we pass this small z through the Tanh function, the output activation is also very small (close to zero).

As these small activations pass through multiple layers, the values tend to collapse further towards zero. By the final layer, the activations are effectively zero.

Impact on Training:

During backpropagation, we calculate the gradients to update the weights.

Vanishing Gradient: Since the weights and activations are tiny, the gradient signal diminishes rapidly as it flows backward through the layers.
Slow Convergence: With tiny gradients, weight updates are insignificant. The model learns extremely slowly, if at all.

Summary for Other Activations:

Sigmoid: suffers from the same vanishing gradient issue in deep networks.
ReLU: While less severe than Tanh/Sigmoid, initializing with tiny values creates very weak signals, leading to extremely slow training.

Case 4: Random Initialization (Large Values)

Now, let's look at the opposite extreme: initializing weights with large random values, for example, in the range of -3 to 3.

The Problem with Saturation (Sigmoid/Tanh):

If weights are large, the weighted sum z = ∑ wᵢ xᵢ becomes a large number (either highly positive or highly negative).

Tanh: The output will be forced to the extremes: -1 or 1.
Sigmoid: The output will be forced to the extremes: 0 or 1.

This state is called Saturation. In these saturated regions, the slope (derivative) of the activation function is nearly zero.

Consequence: When the derivative is zero, the gradient becomes zero. No information flows backward during backpropagation. This leads to the Vanishing Gradient Problem purely due to saturation.

The Problem with Unstable Gradients (ReLU):

ReLU is a non-saturating function (f(x) = max(0, x)), so it does not vanish due to saturation. However, it faces a different problem.

Large weights lead to large z values, which lead to large activations.
This results in massive gradients during backpropagation.
Consequence: The weight updates become too large (Exploding Gradients). The model takes massive "jumps" during optimization, failing to converge to the optimal solution and becoming unstable.

We have now established that simple initialization strategies are insufficient for deep learning:

Zero Initialization: Fails (Symmetry problem, no learning).
Constant Non-Zero: Fails (Symmetry problem).
Small Random Values: Fails (Vanishing gradients, activations collapse).
Large Random Values: Fails (Saturation in Tanh/Sigmoid, Exploding gradients in ReLU).

We clearly need a more intelligent way to initialize weights.

Techniques for Weight Initialization

So far, we have discussed strategies that fail (Zero, Constant, Random Small/Large). Now, we will look at the techniques that actually work. These methods are heuristics—practical solutions proven effective through experimentation to find the optimal range for initial weights.

Recap: What Not to Do

Zero Initialization: No learning due to symmetry.
Non-Zero Constant: Symmetry persists; acts as a linear model.
Small Random Weights: Vanishing Gradient problem.
Large Random Weights: Exploding Gradient (ReLU) or Saturation (Tanh/Sigmoid).

The Intuition: Controlling Variance

The core goal of initialization is to keep the variance of the layer outputs consistent with the variance of the inputs. We need to prevent the signals from exploding or vanishing as they pass through the network.

The Logic:

Our neurons calculate z = ∑ wᵢ xᵢ . The scale of this sum depends heavily on the number of connections. To understand the formulas below, we define:

fan_in: The number of input units coming into the layer (size of the previous layer).
fan_out: The number of output units generated by the layer (size of the current layer).
If a neuron has many inputs (large fan_in), the sum z tends to become very large if weights are not scaled down.
To balance this, if fan_in is high, the weights w must be smaller.

Researchers found that the variance of the weights should be inversely proportional to the number of inputs:

Var(W) = 1/n

This ensures that regardless of how many inputs a node receives, the scale of the output signal remains stable.

1. Xavier / Glorot Initialization

Best Used For: Tanh, Sigmoid, or Softmax activation functions.
Goal: Keeps the variance of input and output gradients the same across layers (balancing both fan_in and fan_out).

A. Xavier Normal

Weights are drawn from a normal distribution centered at 0 with a standard deviation (σ) calculated as:

(Note: Some implementations simplify this to √1/fanᵢₙ , but the formula above is the standard Glorot definition).

B. Xavier Uniform

Weights are drawn from a uniform distribution within the range [-limit, limit], where:

2. He Initialization

Best Used For: ReLU (and its variants like Leaky ReLU).
Goal: Specifically designed to address the fact that ReLU kills half the activations (zeros them out), so we need slightly higher variance to maintain signal strength.

A. He Normal

Weights are drawn from a normal distribution with standard deviation (σ):

B. He Uniform

Weights are drawn from a uniform distribution within the range [-limit, limit], where:

Glorot Uniform is often the default initialization in deep learning libraries like Keras and TensorFlow.
Rule of Thumb:
- Using ReLU ? → Use He Initialization.
- Using Tanh/Sigmoid? → Use Xavier (Glorot) Initialization.

Why Weight Initialization Is Important in Deep Learning (Xavier vs He Explained)

Recent Posts

© 2025 Aryan Upadhyay |