Multi-Head Attention in Transformers Explained: Concepts, Math & Mechanics

Aryan
Mar 2
5 min read

THE PROBLEM WITH SELF-ATTENTION

To understand the limitation of self-attention, let us examine it through a simple example. Consider the sentence:

“The man saw the astronomer with a telescope.”

This sentence is ambiguous and can be interpreted in two different ways. In the first interpretation, the man is using the telescope to see the astronomer. In the second interpretation, the astronomer is the one who has the telescope, and the man sees the astronomer who is holding it. Because of this ambiguity, the sentence carries two valid meanings.

The issue is that standard self-attention tends to capture only one dominant interpretation at a time. It cannot simultaneously represent both meanings. Depending on similarity scores, self-attention may strongly associate man with telescope, leading to the interpretation where the man is using the telescope. Alternatively, it may assign higher similarity between astronomer and telescope, producing the interpretation where the astronomer is holding the telescope. However, once a particular attention pattern dominates, the alternative perspective is suppressed.

The core problem is that self-attention operates from a single representational perspective. When multiple perspectives exist within the same context, self-attention struggles to model them simultaneously. It effectively collapses ambiguity into one interpretation.

In natural language processing, there are many scenarios where capturing multiple perspectives is essential. For example, in a document summarization task, a document may contain several important viewpoints or themes. If we rely only on single-head self-attention, the generated summary may reflect only one dominant perspective. To produce richer and more balanced summaries, we need a mechanism that can attend to the same information from different perspectives and generate multiple interpretations.

This limitation of self-attention motivates the need for multi-head attention, which allows the model to look at the same sequence through multiple attention subspaces simultaneously.

HOW DOES MULTI-HEAD ATTENTION WORK ?

Multi-head attention provides a simple yet powerful solution to the limitation of single self-attention. Consider again a sentence like “The man saw the astronomer with a telescope”, which contains multiple valid perspectives. Using a single self-attention mechanism, we can usually extract only one dominant interpretation. The idea behind multi-head attention is to use multiple self-attention mechanisms in parallel so that different perspectives can be captured simultaneously.

In standard self-attention, we start with word embeddings and apply three weight matrices to generate the query (Q), key (K), and value (V) vectors. Using these vectors, we compute attention scores and finally produce one contextual embedding per word.

In multi-head self-attention, instead of using a single set of weight matrices, we use multiple independent sets. Each set corresponds to one attention head. If we use two heads, we effectively run self-attention twice, each time with its own learned projection matrices. There is no strict restriction on the number of heads; in principle, we could use three, four, or more self-attention heads.

Because each head has its own weight matrices, the query, key, and value vectors for every word are generated multiple times, once per head. The subsequent attention computation remains the same, but it is performed independently for each head. As a result, each head produces its own contextual embedding. graphical representation helps understand it.

For example, assume we are using two attention heads. For the word “money”, we first generate its embedding. This embedding is then projected using two different sets of weight matrices, producing two separate sets of Q, K, and V vectors. The same process is repeated for the word “bank”. Since the vectors are duplicated across heads, the attention computation is also performed twice, once per head.

If we want to generate the contextual embedding for money, we obtain two different contextual vectors, one from each attention head. Each vector reflects a different attention pattern and, potentially, a different semantic perspective. This is the fundamental difference introduced by multi-head attention.

A multi-head attention block therefore consists of multiple self-attention blocks running in parallel. The same input embeddings (for example, money and bank) are fed into each self-attention head, and each head produces its own contextual embeddings. When the sentence is longer, every attention head captures different relationships, dependencies, and perspectives within the same sentence.

By using multiple heads, the attention mechanism is applied multiple times with different learned projections, allowing the model to capture diverse interpretations and relational patterns. In the original Transformer paper, eight attention heads are used, enabling the model to attend to information from several perspectives at the same time.

HOW IS MULTI-HEAD ATTENTION APPLIED ?

Consider a sentence with two words: money and bank. We first generate embeddings for these two words. Assume each embedding has 4 dimensions, so the input embedding matrix has shape 2 × 4. Now, suppose we use two attention heads, meaning we apply two independent self-attention modules in parallel.

For each attention head, we have three weight matrices corresponding to query, key, and value, and each of these matrices has shape 4 × 4. We perform dot products between the word embeddings and these weight matrices separately for each head. As a result, for each head we obtain Q, K, and V matrices of shape 2 × 4, one row per word.

Once we have the query, key, and value vectors, the usual self-attention computation is applied independently in each head. From the first attention head, we obtain an output matrix z₁ of shape 2 × 4, which contains the contextual embeddings y_money¹ and y_bank¹. Similarly, from the second attention head, we obtain z₂ of shape 2 × 4, containing y_money² and y_bank².

At this stage, we have two contextual representations for each word, one from each attention head. These outputs are then concatenated. Concatenating z₁ and z₂ along the feature dimension gives a new matrix z′ with shape 2 × 8. This matrix represents the combined output of all attention heads.

However, the model expects the final output to have the same dimensionality as the original embeddings, which is 2 × 4. To achieve this, we apply a linear transformation using a weight matrix of shape 8 × 4. When we multiply z′ (2 × 8) with this matrix, we obtain the final output z (2 × 4). This projection matrix is also learned during training.

The resulting matrix z contains the final contextual embeddings for both money and bank. This same procedure naturally extends to longer sentences and to a larger number of attention heads.

MULTI-HEAD ATTENTION IN THE ORIGINAL TRANSFORMER

In the original “Attention Is All You Need” paper, the same idea is applied at a larger scale. Instead of 4-dimensional embeddings, each word embedding has 512 dimensions. For a sentence with two words, the embedding matrix therefore has shape 2 × 512.

The model uses eight attention heads. Each head has its own set of weight matrices for query, key, and value. Instead of projecting directly to 512 dimensions, each head projects the embeddings down to 64 dimensions. This means the projection matrices have shape 512 × 64, and after projection, the Q, K, and V matrices for each head have shape 2 × 64.

Self-attention is then applied independently in each head, producing outputs

y_money¹, y_bank¹ through y_money⁸, y_bank⁸, each of shape 2 × 64. These eight outputs are concatenated, resulting in a matrix of shape 2 × 512.

Finally, a linear projection using the output weight matrix Wₒ of shape 512 × 512 is applied. This produces the final output matrix of shape 2 × 512, where the first row corresponds to money and the second to bank. Importantly, the input and output dimensions remain the same.

Reducing the dimensionality from 512 to 64 inside each attention head significantly lowers computational cost. Instead of performing a single expensive self-attention operation at 512 dimensions, the model performs multiple smaller attention operations in parallel. The overall computation remains comparable, while allowing the model to capture multiple perspectives simultaneously.

In this setup, concatenation can be seen as joining different perspectives, while the final linear projection learns how much importance to assign to each perspective. The final representation is therefore a learned mixture of multiple attention-based views of the same sentence. This is how multi-head attention is applied in the original Transformer architecture.

Multi-Head Attention in Transformers Explained: Concepts, Math & Mechanics

Recent Posts

© 2025 Aryan Upadhyay |