Why Is Self-Attention Called “Self”? Understanding Attention Mechanisms from Encoder–Decoder to Transformers

Aryan
Feb 28
5 min read

HOW IS SELF-ATTENTION AN “ATTENTION” MECHANISM ?

To understand why self-attention is called self-attention, we must first understand why it is considered an attention mechanism at all. If we compare self-attention with earlier attention mechanisms such as Bahdanau attention and Luong attention, we realize that self-attention is structurally very different. So the correct way to approach this is: first understand why self-attention qualifies as an attention mechanism, and then understand why the word self is used.

To do this, let us briefly recap how attention originated.

Attention was introduced to solve sequence-to-sequence problems such as language translation using neural networks. In machine translation, we are given a sentence in English and we want to generate its translation in another language, for example Hindi. Traditionally, this problem was solved using an encoder–decoder architecture.

In this setup, both the encoder and the decoder are usually built using LSTM cells. The encoder processes the input sentence word by word. At each time step, it updates its hidden state, and after processing the full sentence, the final hidden state is produced. This final hidden state is called the context vector, and it is intended to be a compressed summary of the entire input sentence. This context vector is then passed to the decoder, which generates the output sentence step by step.

This approach works reasonably well for short sentences, but it has a major limitation. We are placing the entire responsibility of summarizing the sentence on a single context vector, which is just a fixed-size set of numbers. When the input sentence becomes long, it becomes very difficult to compress all the information effectively into this one vector. Researchers observed that when sentence length exceeds roughly 30 words, translation quality degrades significantly.

To address this limitation, the attention mechanism was introduced. The key idea of attention is simple: instead of sending a single context vector to the decoder, we send multiple context vectors, one for each decoder time step. Each context vector focuses on different parts of the input sentence depending on what the decoder is trying to generate at that moment.

In the decoder, when we generate the first output word, it has its own context vector. This context vector contains information about which input words are most relevant for producing the current output word. These context vectors are computed using the hidden states produced by the encoder.

More concretely, let the encoder hidden states be h₁, h₂, h₃, h₄. For the first decoder time step, the context vector c₁ is computed as a weighted sum of these hidden states:

c₁ = α₁₁h₁ + α₁₂h₂ + α₁₃h₃ + α₁₄h₄

Similarly, for the second decoder time step:

c₂ = α₂₁h₁ + α₂₂h₂ + α₂₃h₃ + α₂₄h₄

The α values are called attention weights. Each αᵢⱼ tells us how important the j-th encoder hidden state is when generating the i-th output word. If a particular α value is large, it means that the corresponding encoder hidden state is more useful for producing the current output.

Now the important question is: how are these α values computed?

For each αᵢⱼ, there is a corresponding score eᵢⱼ called the alignment score. The attention weights are obtained by applying softmax over these alignment scores:

αᵢⱼ = softmax(eᵢⱼ)

In Luong attention, the alignment score eᵢⱼ is computed using a dot product between the decoder hidden state sᵢ and the encoder hidden state hⱼ:

eᵢⱼ = sᵢᵀhⱼ

For example, e₁₁ is computed using s₁ᵀh₁, e₁₂ using s₁ᵀh₂, and so on. After computing all eᵢⱼ values, we apply softmax across them to obtain the α values, which are then used to compute the context vector cᵢ as a weighted sum of encoder hidden states.

This mechanism is called attention because the decoder explicitly decides which parts of the input sequence it should focus on while generating each output word. This completes the formation of Luong attention.

Now, with this understanding of attention, we are ready to compare it with self-attention and see why self-attention is still an attention mechanism—and why it is called self.

HOW IS SELF-ATTENTION SIMILAR TO BAHDANAU AND LUONG ATTENTION ?

In self-attention, we do not have a separate encoder and decoder. Instead, we only have a single sentence. From this sentence, we first generate word embeddings, and then from these word embeddings we generate contextual embeddings. To understand this clearly, it helps to compare the process side by side with encoder–decoder attention.

In self-attention, for each word embedding, we generate three new vectors: query (q), key (k), and value (v). These are computed for every word in the sentence. After that, for each word, we generate its contextual embedding by attending to all other words in the same sentence.

Consider the sentence: “turn off the lights”. Suppose we want to generate the contextual embedding for the word turn. To do this, we take the query vector q_turn and compare it with the key vectors of all words in the sentence: k_turn, k_off, k_the, k_lights.

The contextual embedding for turn can be written as:

y_turn = w₁₁v_turn + w₁₂v_off + w₁₃v_the + w₁₄v_lights

Now the key question is: how do we compute these weights w values?

First, we compute similarity scores by taking the dot product between the query and each key:

s₁₁ = q_turnᵀk_turn

s₁₂ = q_turnᵀk_off

s₁₃ = q_turnᵀk_the

s₁₄ = q_turnᵀk_lights

Then we apply softmax over these scores to obtain the weights:

w₁ⱼ = softmax(s₁ⱼ)

These weights are then used to compute the weighted sum of value vectors, giving us the contextual embedding y_turn. This is equivalent to computing a context vector c₁.

Next, when we want to compute the contextual embedding for the word off, we repeat the same process. We take q_off and compute its similarity with all key vectors in the sentence, apply softmax to obtain w₂₁, w₂₂, w₂₃, w₂₄, and then compute:

y_off = w₂₁v_turn + w₂₂v_off + w₂₃v_the + w₂₄v_lights

This is effectively c₂.

If we compare this with encoder–decoder attention, we can see a strong similarity. In Luong attention, the decoder hidden states s₁, s₂, s₃, s₄ act as queries. They ask which encoder hidden state is most useful at the current decoding step. The encoder hidden states h₁, h₂, h₃, h₄ act as keys, and at the same time they also act as values used to form the context vector.

In self-attention, the same idea is applied, but instead of decoder hidden states, the queries come from the same sequence. The word representations themselves generate queries, keys, and values. The similarity computation, softmax normalization, and weighted sum are conceptually the same as in Luong attention.

This is why self-attention is still called an attention mechanism. It explicitly computes alignment scores, converts them into attention weights, and uses those weights to focus on relevant parts of a sequence.

WHY DO WE CALL IT “SELF” ATTENTION ?

In Luong attention, alignment is computed between two different sequences—for example, an English input sentence and a Hindi output sentence. This is an inter-sequence attention mechanism.

In self-attention, the attention computation happens within the same sequence. Queries, keys, and values all come from the same sentence. In other words, the model computes attention between different positions of the same sequence. This is an intra-sequence attention mechanism.

Because the alignment is computed within the sequence itself, and not between two different sequences, it is called self-attention.

Why Is Self-Attention Called “Self”? Understanding Attention Mechanisms from Encoder–Decoder to Transformers

Recent Posts

© 2025 Aryan Upadhyay |