Masked Self Attention Explained: Why Transformers Are Autoregressive Only at Inference

Aryan
Mar 10
9 min read

AUTOREGRESSIVE MODELS

The Transformer decoder behaves differently during inference and training. During inference, the decoder is autoregressive, whereas during training it is non-autoregressive. Inference simply means prediction. When we generate outputs using a Transformer, the decoder produces tokens one by one, each conditioned on previously generated tokens. However, during training, the decoder does not strictly depend on its own past predictions.

The idea of autoregression originally comes from economics and time-series analysis. In time-series models, an autoregressive model predicts the next value using previous values. In deep learning, an autoregressive model follows the same principle: it generates a sequence where each new data point is conditioned on the points generated earlier.

Consider a simple stock-price prediction example. Suppose a model predicts the stock price as 29 on Wednesday and 30 on Thursday. To predict Friday’s price, the model uses the values from Wednesday and Thursday. The prediction for Friday depends on previous outputs, which makes the model autoregressive.

The same idea appears in early encoder–decoder architectures based on LSTMs, such as the original sequence-to-sequence model proposed by Ilya Sutskever and others. In machine translation, a sentence in the source language is passed to the encoder word by word. The LSTM processes these tokens sequentially and, at the end, produces hidden and cell states (hₜ, cₜ), often referred to as the context vector, which summarizes the entire input sentence.

This context vector is then passed to the decoder along with a special <start> token. Based on these inputs, the decoder LSTM generates the first output word. At the next time step, the decoder receives the previously generated word along with the updated hidden state and produces the next word. This process continues step by step until the <end> token is generated and the sequence stops.

A key observation here is that, at any time step, the decoder needs the output from the previous time step to generate the next token. Because each prediction depends on the earlier one, such models are autoregressive. In NLP tasks like language modeling, machine translation, text generation, and summarization, traditional sequence-to-sequence models are autoregressive because they generate tokens sequentially, one step at a time.

These sequence-to-sequence models cannot generate the entire output in a single step or fully in parallel. The reason is fundamental to sequential data: the current output depends on previous outputs. Since context flows forward in time, we need autoregressive modeling to capture these dependencies. This is why the Transformer decoder is also an autoregressive model.

However, there is an important distinction. The Transformer decoder is autoregressive during inference, but not during training. Ideally, one might expect the model to behave the same way in both phases, yet it does not. The reason for this difference lies in a crucial mechanism: masked self-attention.

TRANSFORMER AS AN AUTOREGRESSIVE MODEL

Now let us understand why the Transformer decoder behaves as an autoregressive model during inference but not during training. To make this clear, assume for a moment that the Transformer decoder is autoregressive in both training and inference. We will analyze what happens using a simple machine translation example.

Suppose our task is English-to-Hindi machine translation, and we are using a Transformer model. Consider the following examples from the dataset:

	English	Hindi
1	How are you ?	आप कैसे हैं
2	Congratulations	बधाई हो
3	Thank You	धन्यवाद

Assume the Transformer has already been trained on a large dataset and is now ready for inference.

During inference, the model receives an English sentence, for example, “I am fine”, and needs to translate it into Hindi. The input sentence is passed through the encoder, where embeddings are generated, positional encoding is applied, and all encoder layers are processed. At the end of the encoder, we obtain a sequence of vectors—one vector for each input word. These vectors are then passed to the decoder.

The decoder works autoregressively during inference. At the first time step, we provide the <start> token as input. Based on this token and the encoder outputs, the decoder predicts the first Hindi word, say मैं. At the second time step, the decoder takes the encoder outputs and the previously generated word मैं as input, but suppose it makes a mistake and predicts घटिया instead of बढ़िया. At the third time step, the decoder uses the encoder outputs and the previous prediction घटिया, and generates हूं, which is actually correct. Finally, at the fourth time step, the decoder produces the <end> token, and translation stops. The final output becomes मैं घटिया हूं instead of the correct translation मैं बढ़िया हूं. This step-by-step process illustrates how inference works and why it is inherently autoregressive.

Now let us look at the training process.

During training, we take a sentence from the dataset, for example “How are you?”, and pass it through the encoder in the same way. The encoder outputs are then fed to the decoder. At the first time step, we again provide the <start> token. Suppose the decoder predicts तुम instead of the correct word आप. This prediction is wrong, but during training we use teacher forcing. That means, instead of feeding the decoder’s own prediction into the next time step, we feed the ground truth token from the dataset.

So, at the second time step, we provide आप as input (the correct token), and the decoder predicts कैसे, which is correct. At the third time step, we again feed the correct token from the dataset, but the decoder predicts थे instead of हैं. At the fourth time step, we feed the correct token हैं, and the decoder produces the <end> token. The model’s raw output sequence becomes तुम कैसे थे, whereas the target sequence is आप कैसे हैं.

At this point, we compute the loss between the predicted sequence and the ground truth, apply backpropagation, update the weights, and then move on to the next training example. If we strictly follow this setup, the training process is also autoregressive and sequential, just like inference.

However, this creates a major problem. If training were fully autoregressive, it would be extremely slow. Even for a short sentence of three words, the decoder must run multiple sequential steps. For longer sequences, such as a paragraph with 300 words, the decoder would need to run hundreds of sequential operations for a single training example. This makes training computationally expensive and inefficient.

Now comes an important insight. During inference, autoregressive behaviour is unavoidable because each time step depends on the output of the previous one. We cannot generate future tokens without first generating the past tokens. However, during training, this dependency is not mandatory because of teacher forcing. Since the correct tokens are already available in the dataset, the decoder does not need to rely on its own previous predictions.

Because the inputs at each time step during training come directly from the dataset and not from the model’s previous outputs, the sequential dependency is removed. This allows the Transformer to process tokens in parallel during training. As a result, the decoder can be trained in a non-autoregressive manner, where computations are vectorized and parallelized across time steps. This design makes Transformer training significantly faster and more efficient compared to a strictly autoregressive training setup.

THE PROBLEM IN PARALLELIZING

Now the question is: how do we enable parallel execution during training? To do that, the decoder cannot behave in a fully autoregressive way. However, this is not as simple as it sounds. Until now, we have largely ignored the internal computations of the decoder. In reality, the decoder consists of multiple blocks, one of which is masked multi-head attention. For simplicity, we can temporarily think of multi-head attention as self-attention. To further simplify the discussion, focus only on the first block of the decoder, which is the self-attention block.

Consider the target sentence “आप कैसे हैं”. After tokenization, we pass these three tokens through the embedding layer and then apply positional encoding. As a result, we obtain three embedding vectors, one for each word. These embeddings are then fed into the self-attention block.

Self-attention takes the embedding of a particular word and produces a contextual embedding for that word. This contextual embedding captures information from other words in the sentence. Intuitively, we can think of a contextual embedding as a weighted combination of embeddings of all words in the sentence. For example, the contextual embedding for आप might look like

0.8 × embedding(आप) + 0.1 × embedding(कैसे) + 0.1 × embedding(हैं).

Similarly, the contextual embedding for कैसे could be

0.15 × embedding(आप) + 0.75 × embedding(कैसे) + 0.1 × embedding(हैं),

and for हैं it could be

0.1 × embedding(आप) + 0.2 × embedding(कैसे) + 0.7 × embedding(हैं).

At first glance, this seems perfectly fine. However, a serious problem appears when we compute these embeddings in parallel. When we generate the contextual embedding for आप, we are already using information from कैसे and हैं. But when the sentence is being generated token by token, those future words do not exist yet at that time. The same issue applies to कैसे, whose contextual embedding uses information from हैं, a word that has not been generated yet.

In other words, while computing the representation of the current token, we are allowing it to access future tokens. This is fundamentally incorrect for an autoregressive process. During inference, when we generate the first word, we have no knowledge of what the next words will be. Yet during training, if we compute self-attention naively in parallel, the model is implicitly looking ahead.

This creates a mismatch between training and inference behaviour. In machine learning, we should not allow the model to use information during training that will not be available during inference. Doing so is a form of cheating. It may lead to very good training performance, but poor performance at inference time. This is a classic case of data leakage, and in this context, it is particularly severe because future tokens are leaking into the representation of the current token.

So we face a dilemma. With fully autoregressive training, everything is conceptually correct, but training becomes very slow. When we switch to non-autoregressive, parallel training, computation becomes efficient, but we introduce data leakage because current tokens can see future tokens. To make parallel training work correctly, we must solve this leakage problem.

FINDING THE ANSWER

The solution lies in how self-attention is computed. To understand this, let us walk through the self-attention mechanism step by step and see how masking naturally fits into it.

Consider the target Hindi sentence “आप कैसे हैं”. During training, all tokens of this sentence are passed together into the self-attention module. Before that, each word is converted into an embedding, and positional encoding is added so that word order information is preserved.

Once we have the embeddings, we introduce three learnable weight matrices: W_q, Wₖ, and Wᵥ . Each word embedding is multiplied with these matrices. Through these dot products, every word produces three new vectors: a query (Q) vector, a key (K) vector, and a value (V) vector. This happens for every word in the sentence, so for “आप कैसे हैं”, we get three sets of Q, K, and V vectors.

Next, we stack the query vectors of all words to form the Query matrix, do the same for key vectors to form the Key matrix, and value vectors to form the Value matrix.

The core attention computation begins by taking the dot product of the Query matrix with the transpose of the Key matrix. This results in a matrix where each element represents an attention score, indicating how strongly one word should attend to another. These scores are then scaled by dividing each value by √dₖ, where dₖ is the dimensionality of the key vectors. This scaling stabilizes training by preventing extremely large values.

After scaling, we apply the softmax function row-wise. This converts the attention scores into normalized weights that sum to one. These weights determine how much importance each word gives to the others when forming its contextual representation.

Finally, these attention weights are multiplied with the Value matrix. For example, to compute the contextual embedding for आप, we take

w₁₁ ⋅ V(आप) + w₁₂ ⋅ V(कैसे) + w₁₃ ⋅ V(हैं).

The same process is repeated for कैसे and हैं. This gives us a contextual embedding for each word, incorporating information from relevant tokens in the sentence. This is the standard self-attention mechanism.

Now, here is where masking becomes essential.

While computing the contextual embedding for आप, we should not use information from कैसे and हैं, because at that point in an autoregressive process, those words do not yet exist. Similarly, while computing the contextual embedding for कैसे, the word हैं should not contribute. These future tokens are simply not available at that time step, so their contribution should be ignored.

To enforce this, the attention weights corresponding to future positions must be zero. Conceptually, this means values like w₁₂, w₁₃, and w₂₃ should not contribute to the final embedding. This is achieved through masking.

Practically, after computing the scaled dot-product between the Query and Key matrices, we introduce an additional mask matrix of the same shape. Wherever attention to future tokens should be blocked, we add −∞ to those positions. For allowed positions, we add zero. When we apply softmax after this step, the softmax of −∞ becomes zero, effectively removing any contribution from future tokens.

As a result:

While computing the contextual embedding of आप, only आप contributes.
While computing the contextual embedding of कैसे, only आप and कैसे contribute.
While computing the contextual embedding of हैं, all three words contribute.

This masking mechanism allows the Transformer to process all tokens in parallel during training, while still preserving the autoregressive constraint that prevents a token from accessing future information. At the same time, it eliminates data leakage, because the current token never sees future tokens that would not be available during inference.

This is how masked self-attention enables non-autoregressive, parallel training while keeping the model’s behavior consistent with autoregressive inference.

Masked Self Attention Explained: Why Transformers Are Autoregressive Only at Inference

Recent Posts

© 2025 Aryan Upadhyay |