Transformer Inference Explained: A Step-by-Step Guide to Autoregressive Decoding

Aryan
Mar 19
9 min read

TRANSFORMER INFERENCE

Now we will see how a Transformer performs inference and how it behaves during prediction. After training, when we use the model for prediction, the architecture behaves slightly differently compared to training, especially in the decoder.

Assume we have a machine translation task from English to Hindi. The dataset contains English sentences as input and corresponding Hindi sentences as output. We use a Transformer architecture and train it on a large dataset for this task. After training, we obtain the final learned parameters (weights and biases), and the model is ready for inference.

Now we provide an input sentence like “we are friends” for prediction. Let’s understand step by step how inference happens when this sentence is passed through the Transformer.

The Transformer is divided into two main parts: the encoder and the decoder. The encoder behaves the same during training and inference. The key difference lies in the decoder.

During training, the decoder receives the full target sentence, so it processes all tokens together and behaves in a non-autoregressive manner. However, during inference, the target sentence is unknown. Therefore, the decoder must predict tokens one time step at a time, which makes it autoregressive.

We start with the input sentence “we are friends”, and using the trained Transformer, we want to generate its Hindi translation. First, this sentence is passed to the encoder, which behaves exactly the same as during training. The steps include tokenization, token embedding, positional encoding, followed by multi-head self-attention and a feed-forward network to introduce non-linearity. Between these blocks, Add & Norm operations are applied. This entire encoder block is repeated multiple times (for example, six layers). At the end, for each input token, we obtain a contextual vector.

Now we focus on the decoder, which generates the Hindi output. Decoder behavior differs between training and inference. Since we already have the encoder’s output in the form of context vectors, the decoder now starts its inference process. During inference, the decoder works step by step, producing one output token at each time step.

To start decoding, we feed the special ⟨sos⟩ (start-of-sequence) token into the decoder. As soon as the decoder receives this token, the decoding process begins.

The ⟨sos⟩ token first goes through the embedding layer, producing a 512-dimensional embedding vector. Positional encoding is then added to this embedding. This resulting vector is the decoder input 𝒙₁, which is passed into the decoder block.

The decoder mainly consists of three components: masked multi-head self-attention, cross-attention, and a feed-forward neural network, with Add & Norm layers applied between them.

First, the input 𝒙₁ is sent to the masked multi-head self-attention block. Inside this block, there are three learnable projection matrices: 𝑾ᵩ, 𝑾ₖ, and 𝑾ᵥ. We compute dot products of 𝒙₁ with these matrices to obtain the query, key, and value vectors.

Next, the query and key vectors are multiplied using a dot product to produce a scalar score. This score is scaled by √𝒅ₖ and passed through a softmax function to obtain the attention weight. Since only the ⟨sos⟩ token is present at this step, the attention score represents the similarity of ⟨sos⟩ with itself. This attention weight is then multiplied with the value vector. All vectors involved are 512-dimensional, resulting in a 512-dimensional output vector 𝒛₁.

After this, an Add & Norm operation is applied. The vector 𝒛₁ is added to the original input 𝒙₁ via a residual connection, producing 𝒛₁′. This is then passed through layer normalization, giving us the normalized vector 𝒛₁ⁿᵒʳᵐ.

Now, cross-attention is performed. The vector 𝒛₁ⁿᵒʳᵐ is passed to the cross-attention block, which computes attention between two sequences: the decoder sequence (currently ⟨sos⟩) and the encoder output sequence (“we are friends”). The goal here is to measure how similar the decoder token is to each encoder token’s contextual representation.

In cross-attention, the query vector is derived from the decoder input, while the key and value vectors come from the encoder outputs. The vector 𝒛₁ⁿᵒʳᵐ is multiplied with 𝑾ᵩ to obtain the query vector for ⟨sos⟩. The encoder context vectors are multiplied with 𝑾ₖ and 𝑾ᵥ to produce key and value vectors for “we”, “are”, and “friends”.

We then compute dot products between the single query vector and each of the three key vectors, resulting in three scalar scores 𝒘₁, 𝒘₂, 𝒘₃. These scores are scaled by √𝒅ₖ, passed through a softmax function, and converted into attention weights. The final cross-attention output is computed as:

𝒘₁ · 𝒗(we) + 𝒘₂ · 𝒗(are) + 𝒘₃ · 𝒗(friends)

This produces a new vector, called the cross-attention output vector, denoted as 𝒛𝒄₁.

Next, another Add & Norm operation is applied. The vector 𝒛𝒄₁ is added to 𝒛₁ⁿᵒʳᵐ using a residual connection to form 𝒛𝒄₁′. Finally, layer normalization is applied to obtain the normalized vector 𝒛𝒄₁ⁿᵒʳᵐ.

The next layer is the feed-forward block. This block consists of a position-wise feed-forward neural network that applies a non-linear transformation to the input vector, followed by an Add & Norm operation.

At this stage, we take the vector 𝒛𝒄₁ⁿᵒʳᵐ and pass it into the feed-forward neural network. This network expects a 512-dimensional input vector. It has a hidden layer with 2048 neurons using the ReLU activation function, followed by an output layer with 512 neurons and a linear activation.

The input vector has shape (1, 512). It is first multiplied with a weight matrix of shape (512, 2048), producing a transformed vector of shape (1, 2048). After adding the bias term, the dimension remains (1, 2048). The ReLU activation is then applied to introduce non-linearity. Next, this vector is multiplied with another weight matrix of shape (2048, 512), resulting in an output vector of shape (1, 512). This final output from the feed-forward network is denoted as 𝒚₁. The purpose of this block is to capture non-linear relationships that cannot be modeled by attention alone.

After the feed-forward transformation, we again apply Add & Norm. A residual connection is used where 𝒛𝒄₁ⁿᵒʳᵐ is added to 𝒚₁, producing 𝒚₁′. This vector is then passed through layer normalization, resulting in the normalized output 𝒚₁ⁿᵒʳᵐ.

This output is then forwarded to the next decoder layer. Since the Transformer has six decoder layers, the same sequence of operations—masked self-attention, cross-attention, feed-forward network, and Add & Norm—is repeated across the remaining five decoder layers. The structure remains the same, but each decoder layer has its own set of learned parameters. After passing through the sixth decoder layer, we obtain a final 512-dimensional output vector, denoted as 𝒚𝒇₁ⁿᵒʳᵐ.

Now we move to the output generation step. The vector 𝒚𝒇₁ⁿᵒʳᵐ is passed through a linear layer, followed by a softmax function. The linear layer maps the 512-dimensional input to a vector of size 𝑽, where 𝑽 is the vocabulary size. This is achieved using a weight matrix of shape (512, 𝑽).

The softmax function converts these values into probabilities over all Hindi words in the vocabulary. Each probability represents how likely a particular word is to be the correct output at this time step. The word with the highest probability is selected as the predicted output token.

As the decoder operates in an autoregressive manner, during the first time step it produces the Hindi word “Hum” as the output. This predicted word is then fed back into the decoder in the next time step to generate the subsequent word.

Now we move to the second timestep. At this point, the input to the decoder consists of the tokens generated so far: ⟨sos⟩ and the Hindi word “Hum”, which was produced in the previous timestep. Since the decoder works in an autoregressive manner, it uses all previously generated tokens as input for the next prediction. In the first timestep, the output was “Hum”; now we proceed to the second timestep.

At the second timestep, the decoder input is ⟨sos⟩ and Hum. The overall process remains the same as before, with the only difference being that we now process two tokens instead of one. We send the ⟨sos⟩ and Hum tokens into the decoder input block. The ⟨sos⟩ embedding is already known from the previous step, and now the embedding for “Hum” is computed. After this, positional encoding is applied to both embeddings. As a result, we obtain two positionally encoded input vectors, 𝒙₁ and 𝒙₂.

Next, we pass these vectors 𝒙₁ and 𝒙₂ to the masked multi-head self-attention block. Since this is an attention layer, it contains the projection matrices 𝑾ᵩ, 𝑾ₖ, and 𝑾ᵥ. For each token, we compute the corresponding query, key, and value vectors. This gives us query, key, and value vectors for both 𝒙₁ (⟨sos⟩) and 𝒙₂ (Hum).

We then compute dot products between queries and keys:

𝒒₁ with 𝒌₁ and 𝒌₂, and 𝒒₂ with 𝒌₁ and 𝒌₂.

This results in four scalar values, representing similarities such as ⟨sos⟩ with ⟨sos⟩, ⟨sos⟩ with Hum, Hum with ⟨sos⟩, and Hum with Hum. These values capture how much each token attends to every other token in the sequence.

After computing these dot products, we scale them by √𝒅ₖ. However, before applying the softmax function, we apply masking. The masking ensures that a token cannot attend to future tokens. Specifically, the ⟨sos⟩ token is not allowed to attend to the Hum token, while the Hum token is allowed to attend to both ⟨sos⟩ and itself.

Because of masking, when we compute the attention output for ⟨sos⟩, it only uses its own value vector. For example, the output may look like

0.2 · 𝒗₁ + 0 · 𝒗₂,

meaning there is no contribution from the Hum token. In contrast, when computing the attention output for Hum, both tokens contribute, such as

0.1 · 𝒗₁ + 0.6 · 𝒗₂.

This reflects the fact that Hum can attend to both ⟨sos⟩ and itself.

A natural question arises: since “Hum” has already been predicted, why do we still use masking during inference? The reason is that masking is applied during inference as well, just as it is during training. If masking were removed during prediction, it would create a mismatch between training and inference behavior, leading to degraded prediction quality. Therefore, the same masking strategy used during training is also applied during inference. As the sequence grows, masking ensures that each token can only attend to itself and all previous tokens, never future ones.

After masked self-attention, we obtain two output vectors, 𝒛₁ and 𝒛₂, corresponding to ⟨sos⟩ and Hum. This completes the masked self-attention block.

Next, we apply Add & Norm. Using residual connections, we add 𝒙₁ to 𝒛₁ and 𝒙₂ to 𝒛₂, resulting in 𝒛₁′ and 𝒛₂′. Finally, layer normalization is applied to these vectors, producing the normalized outputs 𝒛₁ⁿᵒʳᵐ and 𝒛₂ⁿᵒʳᵐ.

These normalized vectors are then passed to the next decoder components, starting with cross-attention.

Next, we apply cross-attention. In this block, we compute attention scores between two different sequences: the decoder sequence and the encoder output sequence. The encoder outputs are provided to the cross-attention block along with the two decoder vectors obtained from the masked self-attention step.

At this stage, we have the decoder vectors 𝒛₁ⁿᵒʳᵐ and 𝒛₂ⁿᵒʳᵐ, and the encoder output vectors corresponding to the tokens “we”, “are”, and “friends”. From the decoder side, we compute query vectors by multiplying 𝒛₁ⁿᵒʳᵐ and 𝒛₂ⁿᵒʳᵐ with the matrix 𝑾ᵩ, resulting in two query vectors—one for ⟨sos⟩ and one for Hum. From the encoder side, we compute key and value vectors by multiplying the encoder outputs with 𝑾ₖ and 𝑾ᵥ. This produces key and value vectors for each encoder token.

Next, we compute dot products between the query and key vectors. Since we have two queries and three keys, we obtain six similarity scores:

𝒘₁₁, 𝒘₁₂, 𝒘₁₃ for the first query, and

𝒘₂₁, 𝒘₂₂, 𝒘₂₃ for the second query.

These scores represent how strongly each decoder token attends to each encoder token.

These similarity scores are scaled by √𝒅ₖ and passed through a softmax function to obtain normalized attention weights. Using these weights, we compute the cross-attention output vectors as weighted sums of the value vectors:

𝒛𝒄₁ = 𝒘₁₁ · 𝒗(we) + 𝒘₁₂ · 𝒗(are) + 𝒘₁₃ · 𝒗(friends)

𝒛𝒄₂ = 𝒘₂₁ · 𝒗(we) + 𝒘₂₂ · 𝒗(are) + 𝒘₂₃ · 𝒗(friends)

From this, we obtain two cross-attention output vectors 𝒛𝒄₁ and 𝒛𝒄₂. We then apply Add & Norm. Using residual connections, we add 𝒛₁ⁿᵒʳᵐ to 𝒛𝒄₁ and 𝒛₂ⁿᵒʳᵐ to 𝒛𝒄₂, producing 𝒛𝒄₁′ and 𝒛𝒄₂′. These vectors are passed through layer normalization to obtain 𝒛𝒄₁ⁿᵒʳᵐ and 𝒛𝒄₂ⁿᵒʳᵐ.

Next, we move to the feed-forward layer. The vectors 𝒛𝒄₁ⁿᵒʳᵐ and 𝒛𝒄₂ⁿᵒʳᵐ are passed through a position-wise feed-forward network consisting of two linear layers: the first with 2048 neurons and ReLU activation, and the second with 512 neurons and linear activation. After matrix multiplications and activation, we obtain the vectors 𝒚₁ and 𝒚₂.

Once again, we apply Add & Norm. The vectors 𝒚₁ and 𝒚₂ are added to 𝒛𝒄₁ⁿᵒʳᵐ and 𝒛𝒄₂ⁿᵒʳᵐ respectively using residual connections, and then layer normalization is applied to produce 𝒚₁ⁿᵒʳᵐ and 𝒚₂ⁿᵒʳᵐ.

The Transformer has six decoder layers, so this same process—masked self-attention, cross-attention, feed-forward network, and Add & Norm—is repeated for the remaining decoder layers. The architecture remains identical, but each layer has its own parameters. After passing through the final decoder layer, we obtain 𝒚𝒇₁ⁿᵒʳᵐ and 𝒚𝒇₂ⁿᵒʳᵐ.

Now we move to the output generation step. At each decoder timestep, we want to generate only one output token. Even though we have two output vectors, we do not use both. The vector corresponding to ⟨sos⟩ is ignored because its output was already used to generate the previous token (Hum). We only use the vector corresponding to Hum.

This vector is passed through a linear layer followed by softmax, producing a probability distribution over the Hindi vocabulary. The word with the highest probability is selected as the output. In this timestep, the model predicts the Hindi word “Dost”, which becomes the output.

In the next timestep, the decoder input becomes ⟨sos⟩, Hum, Dost. The same process repeats, but now with three tokens and three vectors. Again, only the last vector is used for output generation. This results in the next predicted word, “Hai”.

In the final timestep, the input becomes ⟨sos⟩, Hum, Dost, Hai. After repeating the same decoder process, the model predicts the ⟨eos⟩ (end-of-sequence) token. Once ⟨eos⟩ is generated, inference stops, and the final output sentence is:

“Hum Dost Hai”

This is how Transformer inference works in practice. The decoder operates in a non-autoregressive manner during training, but becomes autoregressive during inference, predicting one token at a time. The encoder behavior remains the same in both phases. At every timestep, the number of input tokens increases, and masking is applied during inference as well, just as in training, to ensure consistency and stable predictions.

Transformer Inference Explained: A Step-by-Step Guide to Autoregressive Decoding

Recent Posts

© 2025 Aryan Upadhyay |