Layer Normalization Explained: Why Transformers Prefer It Over Batch Norm

Aryan
Mar 6
8 min read

WHAT IS NORMALIZATION ?

Normalization in deep learning refers to the process of transforming data or intermediate model outputs so that they follow specific statistical properties, most commonly a mean of 0 and a variance of 1. The goal is not to change the underlying information, but to make learning numerically stable and efficient.

WHAT DO WE NORMALIZE ?

When features exist on different scales, they can be normalized using techniques such as standardization or min–max scaling, which bring values into a comparable range. This type of normalization is applied directly to the input data.

In neural networks, normalization is applied at two levels. First, the input features can be normalized before they enter the network. Second, the activations of hidden layers can be normalized during training.

For example, consider a dataset with three features f₁, f₂, and f₃, and a network consisting of an input layer, one or more hidden layers, and an output layer. Normalization can be applied to the input values (feature-level normalization) and also to the hidden-layer activations (activation-level normalization). Layer Normalisation belongs to the second category, where we normalize internal activations rather than raw input data.

BENEFITS OF NORMALIZATION

Improved training stability
Normalization reduces the presence of extremely large or small values in activations, which helps prevent exploding or vanishing gradients and makes training more stable.
Faster convergence
When inputs or activations are normalized, gradients tend to have more consistent magnitudes. This leads to smoother and more reliable parameter updates during backpropagation, allowing the model to converge faster.
Mitigation of internal covariate shift
Internal covariate shift refers to changes in the distribution of layer inputs as model parameters update during training. Normalization techniques help reduce this shift, making optimization easier and more robust.
Regularization effect
Some normalization methods, such as batch normalization, introduce mild noise during training due to mini-batch statistics. This acts as a weak form of regularization and can help reduce overfitting.

INTERNAL COVARIATE SHIFT

Covariate Shift refers to a situation where the distribution of the input data changes between training, testing, or real-world prediction. Think of a model trained only on images of red roses. If you then ask it to classify roses of different colors during deployment, the input distribution has changed. This mismatch is the classic covariate shift problem.

Internal Covariate Shift is what happens inside a deep neural network during training. As the network's weights are updated via backpropagation, the distribution of activations output by one hidden layer and fed into the next constantly changes. This means each layer faces a shifting input distribution, making its learning task unstable and harder.

So, while standard covariate shift happens to the model's original input, internal covariate shift is the continuous change in the hidden layer inputs. Normalization techniques are designed to combat this internal instability.

If we want to normalize activations within a network, Batch Normalization is one common method. Another important technique, applied directly to activations, is Layer Normalisation.

BATCH NORMALISATION

Consider a simple dataset with two features, f₁ and f₂, and many data points:

f₁ f₂

2 3

1 1

5 4

6 1

7 1

⋮ ⋮

We train an artificial neural network on this data. The network consists of an input layer, a hidden layer, and an output layer. Our objective is to apply Batch Normalisation to the activations produced by the hidden-layer neurons.

Assume a batch size of 5. This means the first five rows are forwarded through the network together. Let us go through the process step by step.

For the first input row (2, 3), assume the weights between the input layer and the hidden layer are already initialized. For the first hidden neuron, let the weights be w₁ and w₂, with bias b₁. The pre-activation value is

z₁ = 2·w₁ + 3·w₂ + b₁ = 7 (assumed).

For the second hidden neuron, with weights w₃, w₄ and bias b₂,

z₂ = 2·w₃ + 3·w₄ + b₂ = 5 (assumed).

For the third hidden neuron, with weights w₅, w₆ and bias b₃,

z₃ = 2·w₅ + 3·w₆ + b₃ = 4 (assumed).

Now we pass the second row (1, 1) through the same network. Using the same weights, assume the pre-activation values are

z₁ = 2, z₂ = 3, z₃ = 4.

We repeat this process for all remaining rows in the batch. After processing all five rows, the pre-activation values look like this:

f₁ f₂ z₁ z₂ z₃

2 3 7 5 4

1 1 2 3 4

5 4 1 2 3

6 1 7 5 6

7 1 3 3 4

At this stage, Batch Normalisation begins. The normalization is applied feature-wise across the batch, meaning z₁, z₂, and z₃ are normalized independently.

For z₁, we compute the batch mean μ₁ and standard deviation σ₁ using all z₁ values in the batch. Similarly, we compute (μ₂, σ₂) for z₂ and (μ₃, σ₃) for z₃.

The normalization step is

ẑ₁ = (7 − μ₁) / σ₁ = 0.36 (assumed).

Each hidden neuron has two learnable parameters: a scale parameter γ and a shift parameter β. For the first neuron, these are γ₁ and β₁. Initially, γ₁ = 1 and β₁ = 0.

The final normalized output becomes

γ₁·ẑ₁ + β₁ = 0.36.

The original value 7 is replaced with this normalized value. The same procedure is applied to all values of z₁, and then repeated independently for z₂ and z₃ using their respective μ, σ, γ, and β parameters.

After Batch Normalisation, the transformed activations are passed through the activation function, producing the final outputs for all rows in the batch.

WHY DON’T WE USE BATCH NORMALIZATION IN TRANSFORMERS ?

The main reason is that batch normalization does not work well with self-attention, and more generally, it is not well suited for sequential data. In Transformer architectures, the normalization step is applied after the attention mechanism. To understand why batch normalization fails here, let us walk through a concrete example.

In self-attention, we start with a sentence, tokenize it, and convert each token into an embedding vector. Assume each word embedding has a fixed dimension. These embeddings are then passed through the self-attention mechanism, which produces contextual embeddings. Each contextual embedding has the same dimensionality as the input embedding.

So far, we have considered sending one sentence at a time. However, in practice, we train models using batches of sentences. Suppose we are building a sentiment analysis model using self-attention, and we have the following data:

Review	Sentiment
Hi Rahul	1
How are you today	0
I am good	0
You ?	1

Now assume we send two sentences at once, i.e., a batch size of 2. Let the embedding dimension be 3. We compute the embedding vectors for the first two sentences (values are assumed for illustration).

Here, a problem appears. The first sentence has two words, while the second sentence has four words. Self-attention requires all inputs within a batch to have the same shape. Therefore, both sentences must have the same number of tokens.

To solve this, we use padding. We pad the shorter sentence with two padding tokens, each represented by a zero vector [0,0,0]. After padding, both sentences have four tokens, and the padding embeddings are entirely zero.

Now the inputs are ready to be passed into the self-attention module.

We send both sentences through self-attention and compute contextual embeddings for each token. These embeddings can be represented in matrix form. Each sentence produces a matrix of shape 4×3. Since we are processing two sentences together, the input tensor has shape (2,4,3), and the output tensor also has shape (2,4,3).

After self-attention, we obtain contextual embeddings for both sentences. The padding positions remain zero because self-attention involves multiplications, and zero vectors remain zero.

At this stage, since the values of contextual embeddings lie in different ranges, we may consider applying batch normalization. To do this, we stack the matrices vertically.

Each column now represents one embedding dimension. Batch normalization normalizes column-wise across the batch. For each column, we compute the mean μ and standard deviation σ, and apply normalization using learnable parameters γ and β.

However, this is where the problem arises.

In real-world scenarios, batch sizes are much larger. For example, with a batch size of 32, the longest sentence might have 100 words, while others may have only 30 words. To match input sizes, we must pad all sentences to length 100. As a result, most rows in the batch contain padding vectors, which are zero.

When we compute the mean and standard deviation column-wise, these padding zeros heavily influence the statistics. The resulting mean and variance no longer represent the true distribution of the actual token embeddings. Consequently, normalization becomes inaccurate and harmful.

Because batch normalization mixes information across samples in a batch and is highly sensitive to padding, it fails to work reliably for self-attention and Transformer architectures. This issue is common to sequential data in general.

For this reason, batch normalization is not used in Transformers, and alternative normalization techniques are preferred.

HOW DOES LAYER NORMALISATION WORK ?

We use the same setup as in batch normalization: the same neural network and the same data. Suppose we have pre-activation values from a hidden layer denoted as z₁, z₂, z₃. Each feature (or neuron) has its own learnable parameters γ and β.

The key difference lies in how normalization is applied.

In batch normalization, normalization is performed across the batch. In contrast, layer normalization normalizes across features.

This means that when computing the mean and standard deviation in layer normalization, we do so row-wise, across all features of a single data point. As a result, each row has its own mean and standard deviation, and this computation is performed independently for every row.

For the first row, we compute the mean μ₁ and standard deviation σ₁ across z₁, z₂, and z₃.

To normalize the first value z₁, we write

(7 − μ₁) / σ₁

Assume this gives a value of 0.3. We then apply the learnable scale and shift parameters:

0.3 · γ₁ + β₁

For the first-row value of z₂, normalization is performed using the same μ₁ and σ₁, but with its own parameters:

((5 − μ₁) / σ₁) · γ₂ + β₂

Similarly, for the first-row value of z₃:

((4 − μ₁) / σ₁) · γ₃ + β₃

Now consider the second row. We compute a new mean and standard deviation for that row, say μ₂ and σ₂. For example, normalization of the second-row value of z₁ becomes:

((2 − μ₂) / σ₂) · γ₁ + β₁

This process is repeated for all features and for every row independently.

In summary, the main difference between batch normalization and layer normalization is straightforward: batch normalization normalizes across the batch, whereas layer normalization normalizes across features.

LAYER NORMALIZATION IN TRANSFORMERS

In this setup, we send two sentences into the self-attention mechanism, and for both sentences we obtain contextual embeddings. For each word embedding, we compute a separate mean μ and standard deviation σ. At the same time, each embedding dimension has its own learnable parameters γ and β (here shown for three features).

In layer normalization, each embedding dimension has its own γ and β, but the normalization itself is applied across features. This means we compute the mean and standard deviation row-wise, that is, independently for each word embedding.

First, we compute the mean μ₁ and standard deviation σ₁ for the first word across all embedding dimensions. To normalize the first dimension value (for example, 6.5), we write

((6.5 − μ₁) / σ₁) · γ₁ + β₁

If we want to normalize the second dimension value (2.41) of the same word, we use the same μ₁ and σ₁, but apply its own parameters:

((2.41 − μ₁) / σ₁) · γ₂ + β₂

Similarly, for the third embedding dimension of that word:

((3.21 − μ₁) / σ₁) · γ₃ + β₃

This process is repeated for every word in both sentences. Each word has its own mean and standard deviation, computed independently across its embedding dimensions.

Importantly, padding tokens do not affect normalization. Padding embeddings are zero vectors, and because layer normalization is applied row-wise, these zeros only normalize themselves. They do not influence the statistics of real words. As a result, the computed mean and variance always represent the true data distribution.

This is the key advantage of layer normalization in Transformers: it works correctly with variable-length sequences and padding, making it ideal for self-attention–based models.

Layer Normalization Explained: Why Transformers Prefer It Over Batch Norm

Recent Posts

© 2025 Aryan Upadhyay |