top of page


Multi-Head Attention in Transformers Explained: Concepts, Math & Mechanics
Multi-head attention addresses a key limitation of self-attention by enabling Transformers to capture multiple semantic perspectives simultaneously. This article explains the intuition, working mechanism, dimensional flow, and original Transformer implementation of multi-head attention using clear examples and mathematical reasoning.

Aryan
Mar 2


Visualizing Self-Attention: A Geometric Intuition & The Math Behind the Magic
This post explains self-attention using geometric intuition. By visualizing embeddings, dot products, scaling, and weighted vector sums, we see how contextual embeddings shift based on surrounding words and capture meaning relative to context.

Aryan
Feb 26


Scaled Dot-Product Attention Explained: Why We Divide by √dₖ in Transformers
Scaled dot-product attention is a core component of Transformer models, but why do we divide by √dₖ before applying softmax? This article explains the variance growth problem in high-dimensional dot products, the role of scaling in stabilizing softmax, and the mathematical intuition that makes attention training reliable and effective.

Aryan
Feb 21
bottom of page