Scaled Dot-Product Attention Explained: Why We Divide by √dₖ in Transformers

Aryan
Feb 21
5 min read

In self-attention, we first construct the Query (Q), Key (K), and Value (V) matrices by applying learned linear transformations to the input word embeddings. Each word embedding is projected into these three different vector spaces to serve different roles in the attention mechanism.

To compute self-attention, we take the dot product between the Query matrix and the transpose of the Key matrix. This operation measures how strongly each word should attend to every other word in the sequence. The resulting scores are then passed through a softmax function to convert them into normalized attention weights. Finally, these weights are multiplied with the Value matrix to produce the contextualized embeddings, where each word representation now incorporates information from relevant words in the sequence.

Mathematically, this process can be written as:

Attention(Q,K,V) = softmax(QKᵀ)⋅V

In the original research paper Attention Is All You Need, the core idea remains the same, but there is an important modification. The dot-product scores are scaled by the square root of the key dimension dₖ :

Here, the term √dₖ is used for scaling. Without this scaling factor, the dot-product values can become very large when the dimensionality of the vectors increases, which pushes the softmax function into regions with extremely small gradients. Scaling stabilizes training and leads to better optimization behavior.

Because of this scaling step, the mechanism is referred to as scaled dot-product attention.

What is dₖ ?

dₖ denotes the dimensionality of the key vector. For every word in the sequence, we generate a key vector, and the dimensionality of all key vectors remains the same across the entire sequence.

To understand where this value comes from, start with word embeddings. Each word embedding has a fixed dimension, such as 64, 256, or 512 in practical models. For simplicity, assume a toy example where each word embedding is 3-dimensional, so the embedding shape is (1,3).

We then apply linear transformations using the weight matrices W₍Q₎, W₍K₎, W₍V₎. If these weight matrices have dimensions (3,3), multiplying a word embedding (1,3) with any of these matrices produces an output of shape (1,3). As a result, the Query, Key, and Value vectors all have the same dimensionality.

In this case:

d₍q₎ = d₍k₎ = d₍v₎ = 3

Substituting this into the attention formula, we get:

This means that when we compute the dot product QKᵀ, we scale the result by dividing it by √3 before applying the softmax function, and then multiply the resulting attention weights with the Value matrix.

Why do we divide by 1/√3 ?

We divide by 1/√dₖ because of the statistical behavior of the dot product, especially when vector dimensionality increases.

In the attention formula, we compute QKᵀ, where both Q and K are matrices. Conceptually, this operation performs multiple vector–vector dot products. In our toy example, each query vector is 3-dimensional and each key vector is also 3-dimensional. As a result, the dot product produces a matrix containing several scalar values (for example, 3 × 3 = 9 values in this case).

A key property of the dot product is that its variance grows with vector dimensionality. When vectors are low-dimensional, the resulting dot-product values tend to have relatively small variance. As the dimensionality increases, the dot-product values can become much larger and more spread out, leading to high variance. For example, dot products of 512-dimensional vectors will typically have much higher variance than dot products of 3-dimensional vectors.

This high variance becomes a problem when we apply the softmax function. Softmax is sensitive to large input values: large scores get mapped to very high probabilities, while smaller scores get pushed close to zero. When the variance of QKᵀ is high, softmax produces extremely peaked distributions. During backpropagation, this causes the model to focus almost entirely on a few large values, while gradients corresponding to smaller values become negligible, leading to vanishing gradients and unstable training.

Dividing QKᵀ by √dₖ reduces the variance of the dot-product scores. With lower variance, the softmax outputs become more balanced, probabilities are comparable, and gradients flow more evenly during training. This stabilizes optimization without forcing us to use low-dimensional embeddings, which would limit the model’s expressive power.

Therefore, scaling by 1/√dₖ is a simple and effective way to control variance, stabilize softmax behavior, and ensure reliable training in high-dimensional attention models.

How do we reduce the variance of the matrix obtained after the dot product QKᵀ ?

When the dimensionality n of vectors increases, the variance of the numbers produced by the dot product also increases. This high variance is undesirable and must be controlled. A standard way to reduce variance is scaling: when values are divided by a suitable factor, their variance automatically decreases. The key question is what scaling factor should be used.

To understand this, look at the problem from a variance perspective. Each element in the matrix QKᵀ is the result of a dot product between two vectors. As the dimensionality of these vectors increases, the variance of these dot-product values increases proportionally.

Consider a simple case first. Assume one-dimensional vectors. Let v₁ be dotted with v₄, v₅, v₆, producing values s₁₁, s₁₂, s₁₃. If these vectors contain values a,b,c,d, then the dot products are ab,ac,ad. These values can be viewed as samples from a random variable X. What we care about here is not sample variance but population (expected) variance, since future vectors may produce different values. The expected variance of this row is therefore Var(X).

Now increase the dimensionality to two. Suppose the vectors become [a,b] and are dotted with [c,d], [e,f], and [g,h]. The resulting values are ac+bd, ae+bf, and ag+bh. Let these be values of a new random variable Y. In this case, the variance increases because an additional dimension contributes to the dot product. Roughly, we observe:

Var(Y) ≈ 2 Var(X)

If we further increase the dimensionality to three, the dot products involve three summed terms, and the variance increases again. Let these values correspond to another random variable Z. Then:

Var(Z) ≈ 3 Var(X)

In general, for vectors of dimension d, the variance of the dot product grows approximately linearly with d:

Var ∝ d⋅Var(X)

This shows that dimension and variance have a linear relationship. However, for stable training, we want the variance of the dot-product values to remain roughly constant, independent of the dimensionality.

To achieve this, we use a basic statistical property of variance. If a random variable X has variance Var(X), and we scale it by a constant c to form Y = cX, then:

Var(Y) = c² Var(X)

We can use this property to control variance. If the variance after the dot product is approximately d⋅Var(X), dividing the values by √d scales the variance as:

This keeps the variance constant across different dimensions. In the context of self-attention, d = dₖ, the key dimension. Therefore, we divide the dot-product matrix QKᵀ by √dₖ .

As a result, the attention mechanism includes an additional scaling step: after computing QKᵀ, we scale it by 1/√dₖ, then apply softmax, and finally multiply by the value matrix to obtain the contextual embeddings. The final formula becomes:

This scaling ensures controlled variance, stable softmax behavior, and effective training of the self-attention model.

Scaled Dot-Product Attention Explained: Why We Divide by √dₖ in Transformers

Recent Posts

© 2025 Aryan Upadhyay |