Cross Attention in Transformers Explained: Self vs Cross Attention Step by Step
- Aryan

- 3 days ago
- 6 min read
WHAT IS CROSS ATTENTION ?
Cross attention is a mechanism used in transformer architectures, especially in sequence-to-sequence tasks such as machine translation and summarization. It enables the model to selectively focus on relevant parts of an input sequence while generating each token of the output sequence.
Let’s understand this using a machine translation example, where we translate a sentence from English to Hindi using an encoder–decoder architecture.
Assume the input sentence is:
“I like eating ice-cream.”
The encoder’s role is to read the entire input sentence and convert it into a rich set of representations that capture the meaning and context of each word. This is often loosely referred to as a “summary,” but in practice, the encoder produces contextual embeddings for all input tokens. These encoder outputs are then passed to the decoder.
The decoder’s role is to generate the Hindi sentence step by step, in an autoregressive manner. This means it produces one word at a time, and each new word depends on what has already been generated.

Suppose we are at timestep 3 of the decoder. By now, the decoder has already generated the Hindi words:
“मुझे आइसक्रीम”
At this point, the next output word depends on two sources of information:
What the decoder has generated so far
This includes the previously generated Hindi words. The decoder captures this information using self-attention, which models relationships within the same output sequence. Self-attention helps the decoder maintain context and grammatical consistency in the generated sentence.
Relevant information from the encoder output
The decoder also needs to know which parts of the original English sentence are most relevant for generating the next Hindi word. This requires understanding the relationship between the decoder’s current state and the encoder’s representations.
The first requirement is handled by self-attention inside the decoder.
The second requirement—linking the decoder output to the encoder input—is handled by cross attention.

We need a clear relationship between each decoder word and the encoder words. In other words, we want to align the Hindi words being generated with the corresponding English words from the input sentence.
When attention is computed within the same sequence, it is called self-attention.
When attention is computed across two different sequences—decoder queries attending over encoder keys and values—it is called cross attention.
During each decoder timestep, cross attention allows the model to measure how strongly the current output token should attend to each input token. This is how the decoder identifies which English words are most relevant while generating the next Hindi word.
In short, cross attention connects the encoder and decoder, enabling accurate and context-aware sequence-to-sequence generation.
HOW CROSS ATTENTION IS DONE
Cross attention is conceptually very similar to self-attention. The core idea of attention remains the same, and only small differences exist in terms of input, processing, and output. At a conceptual level, both mechanisms work in the same way: they compute relevance scores and use them to build contextual representations.
SELF ATTENTION VS CROSS ATTENTION (INPUT)

Let us first understand the difference from the input perspective.
In self-attention, we provide a single sequence as input. For example, consider the English sentence:
“we are friends”
The goal is to generate contextual embeddings for this sentence. To do this, we pass the token embeddings (or contextual embeddings from the previous layer) of this same sequence into the self-attention layer. Self-attention always operates on one sequence and learns relationships within that sequence itself.
In cross attention, the objective is to learn relationships between two different sequences. Therefore, we pass both sequences as input. For example, we may pass the English sentence “we are friends” along with its Hindi counterpart “हम दोस्त हैं”. The embeddings (or contextual embeddings) of both sequences are provided to the cross-attention layer.
This is the first and most important difference at the input level.
In self-attention, only one sequence is given as input.
In cross attention, two different sequences are provided so that relationships between them can be learned.
SELF ATTENTION VS CROSS ATTENTION (PROCESSING)
In self-attention, we start with a single sequence, for example: “we are friends.” For each token in the sequence, we first obtain an embedding or a contextual embedding from the previous layer. Inside the self-attention block, there are three learnable projection matrices:W_q, Wₖ, and Wᵥ . Each token embedding is multiplied with these matrices to produce three vectors for every word: a query, a key, and a value vector.

Once these vectors are computed, the query vector of each word is compared with the key vectors of all words using a dot product. This gives us attention scores that represent how similar or relevant one word is to the others within the same sequence. These scores are normalized using softmax to obtain attention weights.

Finally, a weighted sum of the value vectors is computed using these weights, resulting in contextual embeddings that capture relationships between words in the same sentence.

In cross attention, the overall structure remains the same, but the key difference is that we now deal with two different sequences instead of one. We still begin by generating embeddings or contextual embeddings for every word, but this time for both the input and the output sequences.
Cross attention follows the same mathematical steps as self-attention, but the roles of query, key, and value vectors change. The query vectors come from the output sequence, while the key and value vectors come from the input sequence. For example, in English-to-Hindi translation, the input sequence is English and the output sequence is Hindi. When forming the query vectors, Hindi token embeddings are multiplied with W_q. At the same time, English token embeddings are multiplied with Wₖ and Wᵥ to generate key and value vectors.

As a result, for each Hindi word we obtain one query vector, and for each English word we obtain a key and a value vector. Each query vector is then dotted with all key vectors, producing a set of scalar attention scores. These scores are passed through softmax to obtain attention weights, which represent how strongly a particular output word is related to each input word.

This process is repeated for every query vector in the output sequence. Using the resulting attention weights, a weighted sum of the value vectors is computed to produce the final contextual representation for each output token.
In summary, the main difference in processing is this: in self-attention, query, key, and value vectors all come from the same sequence, whereas in cross attention, query vectors come from the output sequence and key–value vectors come from the input sequence. This is what allows cross attention to model relationships between two different sequences effectively.

SELF ATTENTION VS CROSS ATTENTION (OUTPUT)
Let us now compare the outputs produced by self-attention and cross attention.
In self-attention, we pass embeddings for every token in a single sequence. After the attention computation, we obtain contextual embeddings for each word in the same sentence. Each contextual embedding can be written as a weighted combination of embeddings from all words in the sequence.
For example, for the sentence “we are friends”, the outputs can be represented as:
ce_we = 0.8 × e_we + 0.1 × e_are + 0.1 × e_friends
ce_are = 0.15 × e_we + 0.75 × e_are + 0.1 × e_friends
ce_friends = 0.2 × e_we + 0.1 × e_are + 0.7 × e_friends
Each contextual embedding is a weighted sum of all word embeddings in the same sentence. This representation clearly shows how self-attention captures relationships within a single sequence. In short, self-attention produces one contextual embedding for every input word.
In cross attention, we work with two different sequences. The number of output contextual embeddings is equal to the number of tokens in the output sequence, not the input sequence.
For example, when translating “we are friends” into Hindi “हम दोस्त हैं”, the contextual embeddings for the output tokens can be written as:
ce_हम = 0.5 × e_we + 0.3 × e_are + 0.2 × e_friends
ce_दोस्त = 0.2 × e_we + 0.2 × e_are + 0.6 × e_friends
ce_हैं = 0.3 × e_we + 0.4 × e_are + 0.3 × e_friends
Here, each output contextual embedding is computed as a weighted contribution of the input sequence embeddings. For instance, the contextual embedding of “हम” is formed by combining information from “we”, “are”, and “friends” with different attention weights.
The key difference is that self-attention computes representations within the same sequence, whereas cross attention computes representations of the output sequence in terms of the input sequence.
CROSS ATTENTION VS BAHDANAU / LUONG ATTENTION
In Bahdanau attention, the decoder computes similarity with input vectors using a small neural network. In Luong attention, this similarity is computed using a dot product. In both cases, a context vector is generated at every decoder timestep based on the input sequence.
Transformer-style cross attention follows the same underlying idea. In encoder–decoder attention layers, queries come from the decoder, while keys and values come from the encoder outputs. This allows every decoder position to attend over all positions in the input sequence. The design of transformer attention was clearly inspired by these earlier attention mechanisms but made more efficient and parallelizable.
USE CASES
Cross attention is widely used in tasks where the input and output sequences are different. Common examples include machine translation and question answering. It is also heavily used in multimodal models, such as image captioning, where images are the input and text is the output. Similarly, in text-to-image generation, text-to-speech systems, and other cross-modal tasks, cross attention is used to model relationships between different types of sequences.
In essence, whenever the model needs to relate one sequence or modality to another, cross attention becomes the central mechanism.


