Scaled dot-product attention is a core component of Transformer models, but why do we divide by √dₖ before applying softmax? This article explains the variance growth problem in high-dimensional dot products, the role of scaling in stabilizing softmax, and the mathematical intuition that makes attention training reliable and effective.