Bahdanau vs. Luong Attention: Architecture, Math, and Differences Explained

Aryan
5 days ago
6 min read

Bahdanau Attention

To compute attention, we first need the alpha (α) values, which represent alignment scores. These scores quantify, at a specific decoder time step, how much influence each encoder hidden state has on the current decoder output. In other words, α tells us the contribution of encoder hidden states when predicting the current token.

The context vector at decoder time step i is defined as:

cᵢ = ∑ⱼ αᵢⱼ hⱼ

Here, each αᵢⱼ measures how strongly the decoder at time step i attends to the encoder hidden state hⱼ. Thus, the alpha values collectively determine how all encoder hidden states contribute to the current output.

For the first decoder time step, the alignment score depends on the first encoder hidden state and the decoder’s previous hidden state:

α₁₁ = f(h₁, s₀)

where s₀ is the initial decoder hidden state.

At the second decoder time step, if we want to compute the alignment with the first encoder hidden state, it depends on the updated decoder state:

α₂₁ = f(h₁, s₁)

This dependency on s₁ is important because attention is not computed in isolation. The decoder state already encodes what has been generated so far, so the current alignment reflects both the specific encoder hidden state and the translation context accumulated up to that point.

In general form, the alignment score is written as:

αᵢⱼ = f(hⱼ, sᵢ₋₁)

Thus, alignment is a mathematical function of two components:

the encoder hidden state hⱼ
the decoder hidden state from the previous time step sᵢ₋₁

The function f(·) produces a raw alignment score, often denoted as eᵢⱼ. A softmax function is then applied across all encoder positions to normalize these scores into valid attention weights αᵢⱼ.

This is where Bahdanau attention introduces its key idea. Instead of using a fixed similarity measure, it learns the alignment function itself. Since feed-forward neural networks are universal function approximators, Bahdanau attention uses a small neural network that takes hⱼ and sᵢ₋₁ as inputs and computes eᵢⱼ. Applying softmax over eᵢⱼ yields the attention weights αᵢⱼ.

These weights are then used to compute the context vector cᵢ, which directly influences the decoder’s output at time step i.

Architecture

The encoder remains unchanged. The modification is introduced in the decoder, where a feed-forward neural network (FFNN) is added to compute attention. This network is part of the decoder and consists of an input layer, a single hidden layer with three neurons, and an output layer. Its role is to learn the alignment between encoder and decoder states.

Assume the encoder produces hidden states h₁, h₂, h₃, h₄, each a 4-dimensional vector. Similarly, the decoder hidden states are s₀, s₁, s₂, s₃, s₄, also 4-dimensional. The input sentence is first passed through the encoder token by token to obtain all encoder hidden states. Once encoding is complete, the decoder starts generating the translation.

At decoder time step 1, we compute the context vector c₁, defined as:

c₁ = ∑ⱼ α₁ⱼ hⱼ

c₁ = α₁₁h₁ + α₁₂h₂ + α₁₃h₃ + α₁₄h₄

The attention weights α₁ⱼ depend on two quantities: the previous decoder hidden state s₀ and the current encoder hidden state hⱼ.

To compute these weights, we concatenate s₀ with each encoder hidden state hⱼ. Since both are 4-dimensional, each concatenated vector [s₀, hⱼ] has dimension 8. Stacking these vectors gives a matrix of shape 4 × 8.

Each row of this matrix is passed to the FFNN input layer. The input-to-hidden weight matrix has size 8 × 3, and the hidden-to-output weight matrix has size 3 × 1. Processing the batch:

4 × 8 · 8 × 3 → 4 × 3

After adding bias and applying a non-linearity (typically tanh), we multiply by the output weights:

4 × 3 · 3 × 1 → 4 × 1

This produces the raw alignment scores:

e₁₁, e₁₂, e₁₃, e₁₄

These scores are not normalized, so we apply softmax to obtain the attention weights:

α₁₁, α₁₂, α₁₃, α₁₄

Using these weights, we compute the context vector c₁. The decoder then takes s₀, the previous output token yₜ₋₁, and c₁ to produce the first output y₁ and update its hidden state to s₁.

At decoder time step 2, the same process is repeated. The context vector is:

c₂ = α₂₁h₁ + α₂₂h₂ + α₂₃h₃ + α₂₄h₄

We concatenate s₁ with each encoder hidden state to form a 4 × 8 matrix, pass it through the same FFNN, normalize the alignment scores using softmax, and compute c₂.

Using c₂, s₁, and y₁, the decoder produces y₂ and updates its hidden state to s₂.

This procedure is repeated for all subsequent decoder time steps. The attention network weights remain the same across time steps and are updated only during backpropagation. Training occurs after the decoder generates the final output sequence, where predictions are compared with ground truth using categorical cross-entropy loss.

The attention weights αᵢⱼ depend on both sᵢ₋₁ and hⱼ because encoder hidden states remain fixed after encoding, while decoder hidden states change at every time step. Since the same feed-forward network is reused at each decoder step, this architecture is called a time-distributed fully connected network.

Mathematical Formulation

Context vector:

cᵢ = ∑ⱼ αᵢⱼ hⱼ

Attention weights (softmax):

αᵢⱼ = exp(eᵢⱼ) / ∑ₖ₌₁⁴ exp(eᵢₖ)

Alignment score:

eᵢⱼ = Vᵀ tanh(W [sᵢ₋₁, hⱼ] + b)

These three equations define Bahdanau attention, also known as additive attention, and the neural network used to compute eᵢⱼ is called the alignment model.

Luong Attention

The objective of Luong attention is the same as Bahdanau attention: to compute attention scores at each decoder time step and identify which encoder hidden states are most relevant for predicting the current output. The difference lies in how these attention scores are computed.

In Bahdanau attention, the alignment score is a function of the previous decoder hidden state sᵢ₋₁ and the encoder hidden state hⱼ. In Luong attention, the alignment score depends on the current decoder hidden state sᵢ and the encoder hidden state hⱼ. This allows the model to use more up-to-date decoder information when computing attention.

Another key difference is the alignment function itself. Bahdanau attention uses a feed-forward neural network to learn the similarity between sᵢ₋₁ and hⱼ. In contrast, Luong attention replaces this learned function with a simpler and more efficient dot product.

sᵢ = [a b c d]

and

hⱼ = [e f g h],

then the alignment score is computed as:

eᵢⱼ = a·e + b·f + c·g + d·h

This value eᵢⱼ represents the attention score before normalization. Applying softmax over all encoder positions gives the attention weights αᵢⱼ.

The motivation behind this design is simplicity and efficiency. Adding a neural network increases model complexity and slows down training. The dot product, on the other hand, is a straightforward similarity measure: if two vectors are similar, their dot product is high; if they are dissimilar, the dot product is low. This makes it an effective and computationally efficient way to identify useful encoder hidden states.

By using the current decoder hidden state, Luong attention incorporates more updated contextual information, allowing the decoder to adjust its output more dynamically. Empirically, this approach has been shown to produce better results than Bahdanau attention in many settings.

In summary, Luong attention differs in two key ways:

it uses the current decoder hidden state instead of the previous one, and
it replaces a neural network–based alignment function with a dot product, making the attention mechanism faster and simpler.

Architecture (Luong Attention)

In Luong attention, alignment scores are computed using the dot product between the current decoder hidden state and all encoder hidden states. At decoder time step 1, we take the current decoder hidden state s₁ and compute dot products with the encoder hidden states h₁, h₂, h₃, h₄. This gives the alignment scores:

e₁₁, e₁₂, e₁₃, e₁₄

These scores are passed through a softmax function to obtain the attention weights:

α₁₁, α₁₂, α₁₃, α₁₄

Using these weights, we compute the context vector c₁ in the same way as in Bahdanau attention.

The key architectural difference lies in how this context vector is used. In Luong attention, the context vector is not fed as an input to the decoder RNN. Instead, c₁ is concatenated with the current decoder hidden state s₁ to form a combined state, commonly denoted as s̄₁. This combined representation is then passed through a feed-forward layer followed by softmax to generate the output token.

At the next decoder time step, the decoder uses s₁ and the previously generated output to produce the next hidden state s₂. We then compute dot products between s₂ and all encoder hidden states to obtain:

e₂₁, e₂₂, e₂₃, e₂₄

Applying softmax yields the new attention weights, which are used to compute the context vector c₂. The context vector c₂ is again concatenated with s₂ to form s̄₂, and the output is produced using a feed-forward layer followed by softmax. This process is repeated for all subsequent decoder time steps.

There are two main differences compared to Bahdanau attention. First, Luong attention uses a dot product instead of a feed-forward neural network to compute alignment scores, which significantly reduces the number of parameters. Second, attention is computed using the current decoder hidden state rather than the previous one, resulting in more up-to-date contextual information during prediction.

Because it relies on dot products and has fewer parameters, Luong attention is computationally efficient and faster to train. Empirically, it has been shown to perform better than Bahdanau attention in many settings. For this reason, Luong attention is also referred to as multiplicative attention.

Bahdanau vs. Luong Attention: Architecture, Math, and Differences Explained

Recent Posts

© 2025 Aryan Upadhyay |