top of page


Masked Self Attention Explained: Why Transformers Are Autoregressive Only at Inference
Transformer decoders behave autoregressively during inference but allow parallel computation during training. This post explains why naive parallel self-attention causes data leakage and how masked self-attention solves this problem while preserving autoregressive behavior.

Aryan
5 days ago
Â
Â
bottom of page