Transformers | Aryan | AI/ML | Blogs & Projects

Exploring Opportunities in AI & Machine Learning

Cross attention in transformers explained visually, showing how the decoder uses query vectors to attend over encoder key and value representations, illustrated with an encoder–decoder architecture for sequence-to-sequence models.

Cross Attention in Transformers Explained: Self vs Cross Attention Step by Step

Cross attention is a key mechanism in transformer encoder–decoder models that allows the decoder to focus on relevant parts of the input sequence. This guide explains cross attention step by step, compares it with self-attention, and shows how output representations are formed using input context.

Aryan

Mar 12

Illustration of masked self-attention in a Transformer decoder showing how future tokens are blocked to enable autoregressive inference, parallel training, and prevent data leakage in self-attention mechanisms

Masked Self Attention Explained: Why Transformers Are Autoregressive Only at Inference

Transformer decoders behave autoregressively during inference but allow parallel computation during training. This post explains why naive parallel self-attention causes data leakage and how masked self-attention solves this problem while preserving autoregressive behavior.

Aryan

Mar 10

Exploring Opportunities in AI & Machine Learning

Cross Attention in Transformers Explained: Self vs Cross Attention Step by Step

Masked Self Attention Explained: Why Transformers Are Autoregressive Only at Inference

© 2025 Aryan Upadhyay |