The Transformer Decoder Explained: Architecture, Math & Operations
- Aryan

- Mar 15
- 7 min read
TRANSFORMER DECODER ARCHITECTURE
Just like the encoder, the decoder in the original Transformer paper also consists of six stacked layers. Below is a simplified representation of the Transformer decoder architecture.

A single decoder block consists of three main sub-blocks:
Masked self-attention
Cross-attention (encoder–decoder attention)
Feed-forward neural network
These decoder blocks are stacked one on top of another. The output of the first decoder block is passed as input to the second block, the second to the third, and so on. Finally, the sixth decoder block produces the final output, which is then sent to the output layer to generate probabilities over the vocabulary.
DEEP DIVE
Let’s understand the decoder using a machine translation example, specifically English-to-Hindi translation, from the training perspective of the decoder.
Suppose the English sentence is “we are friends”, and its corresponding Hindi translation (target output) is “hum dost hai”. During training, the English sentence is sent to the encoder, which processes it and generates contextual encoder embeddings. Only after the encoder completes its work do we start processing the decoder input.
The decoder does not directly receive the English sentence. Instead, it receives:
The encoder’s contextual output vectors
The processed target sentence (Hindi sentence) as input
Decoder Input Processing

Before sending the Hindi sentence to the first decoder block, four operations are performed on the target sentence:
Right Shifting
Tokenization
Embedding
Positional Encoding
Let’s go through these steps one by one.
We start with the target sentence: “hum dost hai”.
During training, we apply right shifting, where a special token <start> is added at the beginning of the sentence. This token acts as a signal to start the decoding process.
So, the transformed input becomes:
<start> hum dost hai
Tokenization
In the second step, we tokenize the sentence by converting each word into a token. After tokenization, we obtain four tokens:
<start>, hum, dost, hai
Embedding Layer
Next, these tokens are passed through the embedding layer. The embedding layer converts each token into a dense vector representation. For example, if the embedding dimension is 512, each token is mapped to a 512-dimensional vector.
Since we have four tokens, we obtain four embedding vectors, each of size 512.
Positional Encoding
At this point, we have vector representations for each token, but these vectors do not contain any information about the order or position of the words in the sentence. To incorporate sequence information, we apply positional encoding.
Positional encoding generates a 512-dimensional vector for each position in the sequence. Since we have four tokens, we get four positional encoding vectors.
We then add the token embeddings and positional encoding vectors element-wise, resulting in the final input vectors.
Final Decoder Input
After combining embeddings and positional encodings, we obtain the input vectors:
x₁, x₂, x₃, x₄
These vectors correspond to the four tokens and are now ready to be sent into the first decoder block for further processing.
Now, each decoder block has three main operations:
first masked multi-head self-attention, second cross-attention, and third the feed-forward neural network.

We already have the input vectors x₁, x₂, x₃, x₄. These vectors are now sent to the first decoder block.
The first operation inside the decoder is masked self-attention.

Masked multi-head self-attention works the same way as normal self-attention, except that masking is applied. It produces contextual vectors z₁, z₂, z₃, z₄, but with an important constraint: future tokens are not allowed to participate in the attention computation.
While generating z₁, we do not consider any future positions, so z₂, z₃, z₄ are masked out.
While generating z₂, we consider z₁ and z₂, but z₃ and z₄ are masked.
For z₃, we consider z₁, z₂, z₃ and mask z₄.
For z₄, all previous tokens are allowed, so no masking is needed.
This masking ensures that, during training, the decoder cannot look ahead and use future words while predicting the current token.
After masked self-attention, we apply the Add & Norm step.
Here, the output of masked self-attention (z₁, z₂, z₃, z₄) is added to the original input vectors (x₁, x₂, x₃, x₄) using a residual connection.
After this addition, we obtain:
z₁′, z₂′, z₃′, z₄′
Next comes layer normalization.
Each vector is passed through layer normalization, where the mean (μ) and variance (σ²) are computed for each vector independently. Using these values, normalization is applied, resulting in:
z₁norm, z₂norm, z₃norm, z₄norm
Layer normalization helps stabilize training and improves gradient flow.
These normalized vectors are now sent to the cross-attention block.

Here, an important interaction happens between the input sequence (English sentence) and the output sequence (Hindi sentence). In cross-attention, each token from the decoder attends to the encoder output to find relevant information from the source sentence.
Unlike self-attention, where only one sequence is used, cross-attention uses two sequences:
The decoder sequence (Hindi side)
The encoder output (English side)
This is done because we need three sets of vectors:
Query vectors from the decoder
Key and value vectors from the encoder
The normalized decoder outputs (z₁norm, z₂norm, z₃norm, z₄norm) are used to generate the query vectors, while the encoder embeddings are used to generate the key and value vectors.
The remaining steps are the same as in standard self-attention: similarity scores are computed, attention weights are applied, and contextual representations are produced.
As a result, we obtain contextual embeddings for the four tokens:
zc₁, zc₂, zc₃, zc₄
Again, we apply Add & Norm.
The cross-attention outputs are added to the previous normalized outputs from the masked self-attention block, forming another residual connection. After addition, we get:
zc₁′, zc₂′, zc₃′, zc₄′
Finally, layer normalization is applied once more, giving the normalized outputs:
zc₁norm, zc₂norm, zc₃norm, zc₄norm
These vectors are now ready to be passed into the feed-forward neural network, which is the final component of the decoder block.

Now we apply the feed-forward block.
At this stage, we have the normalized vectors zc₁norm, zc₂norm, zc₃norm, zc₄norm, and these are passed into the feed-forward neural network.
The feed-forward network has exactly the same architecture as the one used in the encoder. It consists of two fully connected layers.
The first layer has 2048 neurons with ReLU activation, and the second layer has 512 neurons with a linear activation. The network expects an input dimension of 512.
The weight matrix between the input and the first layer is W₁ ∈ ℝ⁵¹²×²⁰⁴⁸, with a bias vector b₁ ∈ ℝ²⁰⁴⁸.
Between the first and second layers, the weight matrix is W₂ ∈ ℝ²⁰⁴⁸×⁵¹², with a bias vector b₂ ∈ ℝ⁵¹².
Since we have four vectors, each of dimension 512, we can represent them as a single matrix of shape 4 × 512, treating them as a batch. This entire batch is passed through the feed-forward network.
First, the matrix multiplication is performed:
Z · W₁ + b₁
Here, the shape transformation is:
(4 × 512) · (512 × 2048) → (4 × 2048)
After adding the bias b₁, we obtain a 4 × 2048 matrix. At this step, the dimensionality of the vectors increases.
Next, we apply the ReLU activation, which introduces non-linearity and allows the network to capture more complex patterns in the data.
Then, this output is passed to the second layer:
(ReLU(Z · W₁ + b₁)) · W₂ + b₂
This results in:
(4 × 2048) · (2048 × 512) → (4 × 512)
After adding the bias b₂, we again obtain four vectors of dimension 512. Although the dimensionality returns to 512, the representations now capture non-linear transformations of the input. We denote these output vectors as y₁, y₂, y₃, y₄.
After the feed-forward network, we again apply the Add & Norm step.
A residual connection is used, where the feed-forward outputs (y₁, y₂, y₃, y₄) are added to the corresponding inputs (zc₁norm, zc₂norm, zc₃norm, zc₄norm).
After this addition, we obtain:
y₁′, y₂′, y₃′, y₄′
These vectors are then passed through layer normalization, resulting in:
y₁norm, y₂norm, y₃norm, y₄norm
All of these vectors are of 512 dimensions. These four vectors represent the final output of the first decoder block.
Since the Transformer decoder consists of six decoder blocks, the output of the first decoder block (y₁norm, y₂norm, y₃norm, y₄norm) is passed as input to the second decoder block.

The same sequence of operations is repeated in each decoder block. The only difference between blocks is that each block has its own set of parameters, while the architecture and operations remain identical.
Finally, after passing through the sixth decoder block, we obtain the final output vectors:
yf₁norm, yf₂norm, yf₃norm, yf₄norm
These vectors are the final decoder representations, which are then used by the output layer to generate the predicted tokens and produce the final translated sentence.
Now we come to the last part of the decoder architecture.
We have the final normalized vectors yf₁norm, yf₂norm, yf₃norm, yf₄norm. For each of these vectors, the decoder must produce one Hindi word.
Output Layer: Linear + Softmax

The output part consists of two layers:
A linear layer
A softmax layer
This output layer is similar in spirit to the output layer of a feed-forward neural network. Here, we have a single linear layer that takes a 512-dimensional input and projects it to a vector of size V, where V is the vocabulary size of the Hindi language.
Since we are performing English-to-Hindi translation, we build the vocabulary from the Hindi dataset. Suppose we have around 5,000 sentence pairs, and after processing the Hindi text, we find 10,000 unique Hindi words. In this case, the vocabulary size V = 10,000.
This means the output layer has:
10,000 neurons, one for each unique Hindi word
A weight matrix W₃ ∈ ℝ⁵¹²×¹⁰⁰⁰⁰
A bias vector b₃ ∈ ℝ¹⁰⁰⁰⁰
Each neuron corresponds to exactly one Hindi word in the vocabulary.
Applying the Output Layer

We have four vectors, one for each token (<start>, hum, dost, hai). Since each vector is 512-dimensional, we can stack them into a matrix of shape 4 × 512 and process them in batch form. This allows parallel computation, but for clarity, let’s understand the process for a single vector.
We first take yf₁norm, which corresponds to the <start> token, and pass it through the linear layer:
yf₁norm · W₃ + b₃ → 1 × 10,000
This produces 10,000 raw scores, one score for each Hindi word. These values are not normalized and can be any real number.
Softmax and Probability Distribution
Next, we apply the softmax function to these 10,000 values. Softmax converts these raw scores into a probability distribution, where:
All probabilities are between 0 and 1
The sum of probabilities is exactly 1
Each probability represents how likely the corresponding Hindi word is for that position.
The word whose neuron has the highest probability is selected as the output token.
For example, if the word “hum” has the highest probability, then the output corresponding to <start> is “hum”.
Generating the Full Sentence
The same process is repeated for the remaining vectors:
yf₂norm → output “dost”
yf₃norm → output “hai”
yf₄norm → output <end>, at which point decoding stops
This is how the decoder generates the output sequence step by step.
Complete Decoder Flow (Summary)
So overall, the decoder works as follows:
Input processing
Right shifting
Tokenization
Embedding
Positional encoding
Decoder block operations (repeated 6 times)
Masked multi-head self-attention
Cross-attention
Feed-forward neural network
Output generation
Linear projection to vocabulary size
Softmax to obtain probabilities
Select the word with the highest probability
The selected words form the final translated sentence.
This completes the Transformer decoder architecture end to end.


