top of page

Visualizing Self-Attention: A Geometric Intuition & The Math Behind the Magic

  • Writer: Aryan
    Aryan
  • Feb 26
  • 2 min read

GEOMETRIC INTUITION


Let's understand what is mathematically and geometrically happening inside self-attention. We will see how contextual embeddings are formed by interpreting the process in geometric terms.

Consider a simple sentence: “money bank.”

We start by converting each word into a vector embedding using a technique such as Word2Vec. For simplicity, assume a 2-dimensional embedding space. Let the embeddings be:

• eₘₒₙₑᵧ = (2, 7)

• eᵦₐₙₖ = (9, 3)

We can plot these vectors on a 2D graph to visualize their positions.

Next, we apply linear transformations to obtain the query, key, and value vectors. For this, we use three weight matrices: W_Q, W_K, and W_V. For simplicity, assume all of them are 2×2 matrices with hypothetical values.

Now, we take each word embedding and compute dot products with these matrices:

• eₘₒₙₑᵧ · W_Q , eₘₒₙₑᵧ · W_K , eₘₒₙₑᵧ · W_V

• eᵦₐₙₖ · W_Q , eᵦₐₙₖ · W_K , eᵦₐₙₖ · W_V

This linear transformation converts a single embedding vector into three different vectors. As a result, we obtain:

• qₘₒₙₑᵧ , kₘₒₙₑᵧ , vₘₒₙₑᵧ

• qᵦₐₙₖ , kᵦₐₙₖ , vᵦₐₙₖ

At this point, we have six vectors in total.

The next step is to compute contextual embeddings, denoted as yₘₒₙₑᵧ and yᵦₐₙₖ. Let us focus on generating the contextual embedding for the word bank.

To do this, we measure how much the word bank attends to money and to itself. We compute similarity scores using dot products:

• s₂₁ = qᵦₐₙₖ · kₘₒₙₑᵧ

• s₂₂ = qᵦₐₙₖ · kᵦₐₙₖ

Assume these dot products give:

• s₂₁ = 10

• s₂₂ = 32

These values are hypothetical and used only for understanding. Since the key dimension dₖ = 2, we apply scaling by √dₖ:

• s₂₁′ = 10 ⁄ √2 ≈ 7.09

• s₂₂′ = 32 ⁄ √2 ≈ 22.69

Next, we apply the softmax function to these scaled scores to obtain attention weights:

• w₂₁ = 0.2

• w₂₂ = 0.8

These weights tell us how much attention the word bank gives to money and to itself. At this stage, we do not need all vectors anymore. We only require the value vectors vₘₒₙₑᵧ and vᵦₐₙₖ.

We now compute a weighted sum of the value vectors:

• w₂₁ · vₘₒₙₑᵧ

• w₂₂ · vᵦₐₙₖ

Geometrically, this means scaling down vₘₒₙₑᵧ and scaling up vᵦₐₙₖ.

We then add these scaled vectors using vector addition (parallelogram rule). The resultant vector is the contextual embedding yᵦₐₙₖ.

If we now compare yᵦₐₙₖ with the original embedding eᵦₐₙₖ on the graph, we observe that yᵦₐₙₖ shifts closer to eₘₒₙₑᵧ. This happens because, in this context, the word bank is strongly related to money.

Self-attention generates embeddings based on context. If we instead used the sentence “river bank,” the word bank would move closer to river after self-attention. Even if the original embeddings of river and bank are far apart, self-attention can bring them closer based on contextual usage.

This is why self-attention is powerful. It allows a word to be represented differently depending on the neighboring words in the sentence. The final contextual embedding captures meaning relative to context, and this behavior emerges from patterns learned from the dataset during training.




bottom of page