Attention Mechanism Explained: Why Seq2Seq Models Need Dynamic Context

Aryan
Feb 12
8 min read

THE WHY ?

Now let us discuss why we need the attention mechanism.

Consider the encoder–decoder architecture. The pink block represents the encoder, the green block represents the decoder, and the yellow cells denote LSTM units. In the encoder–decoder setup, we provide an input sentence to the encoder. The encoder processes this sentence sequentially across time steps. After processing the entire sequence, the encoder generates a summary of the sentence in the form of a single vector. This vector is essentially a set of numbers that represents the overall meaning of the input sentence. We then pass this vector to the decoder, and the decoder generates the output sentence step by step, one word at a time. This is the basic working of the encoder–decoder architecture.

Now, consider a long sentence such as:

“Once upon a time in a small Indian village, a mischievous monkey stole a turban from a sleeping barber, wore it to a wedding, danced with the bewildered guests, accidentally got crowned the ‘Banana King’ by the local kids, and ended up leading a vibrant, impromptu parade of laughing villagers, cows, and street dogs, all while balancing a stack of mangoes on its head, creating a hilariously unforgettable spectacle and an amusing legend that the village still chuckles about every monsoon season.”

This sentence has around 50 words. If we read this sentence and try to translate it into Hindi, we cannot do it perfectly in a single reading. We naturally struggle to summarize the entire sentence at once, and translating it in one go becomes difficult. The same limitation appears in the encoder–decoder architecture.

When we give a long sentence to the encoder and ask it to capture everything in one pass, the encoder is forced to compress the entire meaning into a single fixed-size context vector. If the sentence length increases beyond around 25 words, this context vector becomes overloaded. As a result, important information can be lost, leading to the same difficulty that humans face when trying to remember and process a very long sentence at once.

This also creates unnecessary pressure on the decoder. While predicting a single word at a specific decoder time step, we do not need the entire sentence representation. At any decoder time step, we only need information related to a few relevant input words. However, in the standard encoder–decoder architecture, the decoder receives the same static context vector at every time step. Because of this static representation, the decoder struggles to focus on the most relevant parts of the input sentence, which leads to decoding errors.

Ideally, instead of using the whole sentence representation every time, it would be better if the model could dynamically select specific words or parts of the input sentence depending on the current decoding step. If the model could focus on different input tokens at different time steps, based on what is needed to predict the next output word, the translation and sequence generation process would become much more effective. This need for a dynamic, time-dependent representation is the core motivation behind the attention mechanism.

THE SOLUTION

The idea behind the solution is inspired by how we, as human beings, process language. When we translate a long sentence or a large piece of text, we do not translate it by holding the entire sentence in our mind at once. Instead, as we move through the sentence, we translate it on the go.

While reading, our eyes naturally create a region of attention. At a given moment, we focus only on a small part of the sentence that is relevant, translate that part, and then shift our attention forward. This shifting focus continues until the entire sentence is translated. This is how humans handle long and complex sentences efficiently.

We want to introduce this same idea into our model architecture. Rather than forcing the model to rely on a single fixed summary of the entire input sentence, we pass additional information that tells the model which part of the input is most useful for producing a particular output word. In other words, for each output step, there exists a specific set of input words that should receive more focus.

This mechanism of dynamically assigning importance to different parts of the input sequence is called the attention mechanism.

THE WHAT

In the encoder, we pass the hidden state of the LSTM cell from one time step to the next, such as h₀, h₁, h₂, h₃, h₄. If we look at the decoder, it also maintains its own internal hidden states, denoted as s₀, s₁, s₂, s₃, s₄. The decoder receives inputs y₀, y₁, y₂, y₃ sequentially. The key requirement is that we must pass appropriate information to the decoder about which encoder time steps are useful for generating the output.

In the encoder, we start with an initial hidden state and propagate hidden states through the LSTM cells, producing h₀, h₁, h₂, h₃, h₄. All of these hidden states are vectors. Similarly, the decoder has its own internal states s₀, s₁, s₂, s₃, s₄, and the inputs to the decoder are y₀, y₁, y₂, y₃. In the attention mechanism, at every decoder time step, we explicitly provide information about which encoder time step is relevant.

In the standard encoder–decoder architecture, at a particular decoder time step, say time step 2, the decoder typically requires two inputs: the previous output y₁ and the previous decoder hidden state s₁. In other words, we provide two pieces of information at each decoder time step. However, with the attention mechanism, we introduce an additional piece of information that indicates which encoder time steps are important. As a result, instead of two inputs, we now provide three pieces of information at each decoder time step.

We denote this additional input as the context vector cᵢ, where c₁, c₂, c₃, c₄ correspond to different decoder time steps. This cᵢ is called the attention input. The major difference from the vanilla encoder–decoder model is that, at a particular decoder time step, the decoder now receives the triplet [yᵢ₋₁, sᵢ₋₁, cᵢ], instead of just [yᵢ₋₁, sᵢ₋₁]. The purpose of cᵢ is to inform the decoder about which encoder hidden states are most useful for producing the current output.

For example, if we consider the first decoder time step where the model needs to output the Hindi word “light,” it may turn out that the fourth encoder hidden state is the most relevant. In that case, the role of c₁ is to provide information derived mainly from that fourth encoder hidden state to the decoder. Since cᵢ is constructed from encoder hidden states, it is itself a vector. If we assume each encoder hidden state hⱼ has a fixed dimensionality, then cᵢ will have the same dimensionality as hⱼ.

When more than one encoder hidden state is relevant, we combine them using a weighted sum. At a given decoder time step, the attention mechanism assigns a weight to each encoder hidden state. Let these weights be α₁, α₂, α₃, α₄. Then, for the first decoder time step, the context vector is computed as

c₁ = α₁h₁ + α₂h₂ + α₃h₃ + α₄h₄.

Here, α values are scalars, and hⱼ are vectors. If, for example, α₄ = 0.6 dominates, then h₄ contributes more strongly to c₁ than the other hidden states. This process is repeated for every decoder time step, such as c₂, c₃, and so on, with a new set of attention weights each time.

In general, for decoder time step i and encoder time step j, the context vector is given by

cᵢ = ∑ⱼ αᵢⱼ hⱼ.

For instance, to compute c₁, we calculate α₁₁, α₁₂, α₁₃, α₁₄ and then form

c₁ = α₁₁h₁ + α₁₂h₂ + α₁₃h₃ + α₁₄h₄.

Similarly, for c₂,

c₂ = α₂₁h₁ + α₂₂h₂ + α₂₃h₃ + α₂₄h₄.

If there are i decoder time steps and j encoder time steps, the total number of attention weights αᵢⱼ is i × j. For example, with four encoder time steps and four decoder time steps, we compute sixteen attention weights in total.

Now let us see how the α (alpha) values are calculated.

α is an alignment score or similarity score. It tells us how important a particular encoder hidden state is while deciding the decoder output at a specific time step. In other words, α indicates which encoder hidden state plays a more important role for the current decoder prediction.

Let us calculate α₂₁. This value depends on two quantities: the encoder hidden state h₁ and the decoder’s previous hidden state s₁. Formally,

α₂₁ = f(h₁, s₁).

Here, s₁ is the previous hidden state of the decoder. It is important because, in the attention mechanism, the decision at the current decoding step depends on what has already been generated in the previous step. The context of earlier outputs is carried through s₁. Thus, α₂₁ is a function of both h₁ and s₁.

In general, for all decoder time steps i and encoder time steps j,

αᵢⱼ = f(hⱼ, sᵢ₋₁).

For example,

α₂₁ = f(h₁, s₁),

α₂₂ = f(h₂, s₁),

α₂₃ = f(h₃, s₁).

So, if we want to calculate α₂₃, we simply apply the same function f to h₃ and s₁. This naturally leads to the question: what exactly is this function f?

Researchers introduced a simple but powerful idea here. Instead of manually designing a mathematical function, they leveraged the fact that artificial neural networks (ANNs) are universal function approximators. Given sufficient data, a neural network can approximate almost any function. Therefore, instead of explicitly defining f, we use a small neural network to learn this function automatically.

To compute α₂₁, we feed h₁ and s₁ as inputs to a neural network, and the network outputs the alignment score α₂₁. Hence, α₂₁ = f(h₁, s₁), where f is implemented as a neural network with its own weights and biases. This neural network is trained jointly with the rest of the model. During training, all parameters of the encoder, decoder, and attention network are updated together using backpropagation. This end-to-end training is one of the strongest aspects of the attention-based architecture.

Let us summarize this with an example.

Assume the decoder has already generated the word “light” at the first time step, and now we are at the second decoder time step. If we want to generate the Hindi word “band,” we need to know how much importance to assign to each encoder word: “turn,” “off,” “the,” and “light.” This importance is captured by the alignment scores α₂₁, α₂₂, α₂₃, and α₂₄.

To compute these values, we use the same neural network repeatedly. We input (s₁, h₁) to get α₂₁, then (s₁, h₂) to get α₂₂, then (s₁, h₃) to get α₂₃, and so on. Once all α values are computed, we perform a weighted sum to form the context vector:

c₂ = α₂₁h₁ + α₂₂h₂ + α₂₃h₃ + α₂₄h₄.

This context vector c₂ is then passed to the decoder at the second time step along with s₁ and y₁. Using these three inputs, the decoder generates the next output word (“band”) and computes the next decoder hidden state s₂. This process continues for subsequent time steps.

This entire procedure captures the core idea behind the attention mechanism: dynamically focusing on different parts of the input sequence at different decoding steps to generate more accurate and meaningful outputs.

BENEFITS OF APPLYING THE ATTENTION MECHANISM

From the graph, we can observe that the x-axis represents sentence length, while the y-axis represents the BLEU score, which is a standard metric used to evaluate translation quality. In this comparison, four different models are evaluated across increasing sentence lengths. As the sentence length crosses around 30 words, the performance of the vanilla encoder–decoder models starts to degrade. In contrast, the attention-based model maintains a higher BLEU score even for longer sentences. This clearly demonstrates the advantage of using attention, especially when handling long input sequences.

Another important benefit of the attention mechanism is interpretability. Since we explicitly compute attention weights (the α values), we can visualize and analyze them. For example, if we have four input words and four output words, we obtain sixteen α values in total. For a particular output word, we can examine the corresponding α values associated with all input hidden states. By plotting these values, we can see which input words the model focused on while generating a specific output word. This makes it easier to understand and verify how the model is performing the translation.

The fact that attention weights can be visualized also provides strong empirical evidence that the mechanism is actually learning meaningful alignments between input and output sequences. This confirms that the attention mechanism is not only effective but also reliable in practice.

In the original attention-based sequence-to-sequence encoder–decoder work, researchers further improved performance by using a bidirectional LSTM in the encoder. This allowed the model to capture both past and future context in the input sequence, which, when combined with attention, led to even better translation quality.

Attention Mechanism Explained: Why Seq2Seq Models Need Dynamic Context

Recent Posts

© 2025 Aryan Upadhyay |