Positional Encoding in Transformers Explained from First Principles

Aryan
Mar 4
12 min read

WHY POSITIONAL ENCODING IS REQUIRED ?

In self-attention, a sentence is first divided into tokens, and embeddings are generated for each token. For example, consider the words river and bank. Once their embeddings are created, the self-attention mechanism uses these embeddings to produce contextual embeddings—embeddings that are dynamic and change depending on how a word is used.

The word bank in the sentences river bank and money bank will have different contextual embeddings because the surrounding context is different. This is one of the key strengths of self-attention: it can model context-dependent meaning effectively.

Another important advantage of self-attention is that all contextual embeddings can be computed in parallel. This means that regardless of sentence length, embeddings for all tokens are generated simultaneously. In contrast, traditional RNNs process tokens one by one across time steps, making the computation inherently sequential and slow.

So, the two major benefits of self-attention are:

It generates rich contextual embeddings.
All computations are performed in parallel.

However, self-attention also has an important limitation. By itself, it cannot capture the order of words in a sentence. The self-attention mechanism has no built-in way to understand which word comes before or after another.

This limitation does not exist in RNNs, because RNNs process words sequentially and therefore inherently encode word order. In contrast, self-attention treats all tokens simultaneously and, without additional information, can treat sentences with the same words but different orders as similar. For example, sentences like Rahul killed the lion and The lion killed Rahul would not be distinguished correctly based on word order alone.

Therefore, we must explicitly provide information about the ordering of words to the self-attention mechanism. This is where positional encoding comes into the picture.

Positional encoding is a technique used in the self-attention architecture to indicate which word appears at which position in a sentence, enabling the model to understand and preserve word order.

PROPOSING A SIMPLE SOLUTION

Let us understand this using a first-principles approach. Consider the sentence Rahul killed the lion. This sentence contains four words, but self-attention by itself does not understand the sequence or order of words. Therefore, we need to explicitly pass this ordering information to the self-attention block.

Let us propose a very simple solution. The simplest idea is counting positions. We assign a sequence number to each word: Rahul gets position 1, killed gets 2, the gets 3, and lion gets 4. Now, assume the word embedding dimension is 512. We can add one extra dimension to store this positional value, making the embedding size 513. In this way, we send the word order information to the self-attention block.

However, this simple solution has multiple serious problems, which makes it unusable in practice.

The first problem is that this approach is unbounded. For a small sentence, we get small values like 1, 2, 3, and 4. But if we process a large document or an entire book, word positions can easily reach tens of thousands. Sending such large numerical values into a neural network is problematic. Neural networks rely on backpropagation, and large input values can lead to unstable gradients. In practice, neural networks perform better when inputs lie in a small range, typically between −1 and 1. This makes raw positional indices unsuitable.

We might try to fix this by normalizing the positions so that all values lie between 0 and 1. For example, we can divide each position by the total number of words in the sentence, ensuring that the last position is always 1. However, this approach also fails.

Consider two sentences: thank you and Rahul killed the lion. For the first sentence, the normalized positions are 1/2 and 2/2. For the second sentence, they are 1/4, 2/4, 3/4, and 4/4. Now observe that the second position in the first sentence has value 1, while the second position in the second sentence has value 0.5. Even though both words appear at the same position index, they receive different values. This means we are not providing consistent positional information, which makes this normalization approach unreliable.

The second problem is that the positional values produced by this method are discrete numbers. Neural networks generally prefer smooth, continuous inputs rather than abrupt, discrete transitions. Using discrete values can lead to numerical instability and poor gradient flow during training.

The third problem is that this approach captures only absolute positions, not relative positions. In Rahul killed the lion, assigning positions 1, 2, 3, and 4 tells us where each word appears, but it does not encode how far one word is from another. Relationships such as “this word comes two positions after that word” are not represented. The distance between words is not captured, and this relational information is lost.

Because of these limitations, we need a positional representation that is bounded, continuous, and capable of capturing relative positioning. Ideally, it should also be periodic. One such option is to use trigonometric functions, such as the sine function. The sine function is bounded between −1 and 1, continuous, and periodic—meaning its values repeat in a structured way across positions.

THE SINE FUNCTION AS A SOLUTION

Let us see how we can use the sine curve for positional encoding. On the x-axis, we can imagine the word position, and on the y-axis, we have the encoded values. Consider the same sentence: Rahul killed the lion. We already know the word positions—Rahul is at position 1, killed at position 2, and so on.

For each word position, we calculate its encoded value using the sine function. For the first position,

y = sin(1) ≈ 0.84,

for the second, y = sin(2) ≈ 0.87,

for the third, y = sin(3) ≈ 0.19,

and for the fourth, y = sin(4) ≈ −0.90.

We take these four values and insert them into the word embeddings of the corresponding words. In this way, each of the four words receives a positional encoding value. This approach fixes several problems of the earlier method. The values are bounded between −1 and 1, and because the sine function is continuous, it provides smoother transitions. Using the sine curve also gives some notion of relative positioning.

However, this approach still has a major limitation. One key requirement of positional encoding is that each word position should have a unique representation. Ideally, no two different positions should share the same positional encoding value. The sine function, however, is periodic, meaning its values repeat after a certain interval. As a result, the same value can appear for different positions. If two words end up with the same positional encoding, the model can get confused. So even though this approach solves earlier problems, periodicity becomes a serious issue.

To address this, we can extend the idea by using two trigonometric functions instead of one.

So far, we have used only one function: y = sin(position). Instead, we can use both

y = sin(position) and y = cos(position).

For a single word, this gives us two values—one from sine and one from cosine. This means we now represent each word as a vector instead of a scalar. For example, for the word Rahul at position 1, we compute

sin(1) ≈ 0.84 and cos(1) ≈ 0.54.

So the positional encoding for Rahul becomes the vector [0.84, 0.54].

We insert these two values into the word embedding, and we apply the same process to the remaining words. Now, positional encoding is a vector rather than a single number. This significantly reduces the probability that two different positions will have the same encoding. While collisions are still theoretically possible, this representation is much more robust.

We can further extend this idea by adding more sine–cosine pairs. For example, we can also include sin(position / 2) and cos(position / 2).

Now, if we compute positional encoding for Rahul using four functions—sin(1), cos(1), sin(1/2), and cos(1/2)—the word is represented as a four-dimensional vector. The same process is applied to every other word. With this setup, the probability that two words share the exact same positional encoding vector becomes much lower.

If needed, we can continue extending this by adding additional pairs such as sin(position / 3) and cos(position / 3), resulting in a six-dimensional positional encoding. Although there is still a theoretical chance of overlap, in practice this approach makes positional encodings sufficiently unique and expressive for modeling word order effectively.

EXPLAINING POSITIONAL ENCODING

Let us see how positional encoding is explained in Attention Is All You Need. Suppose we have the sentence river bank. We already have word embeddings for river and bank that we want to send to the self-attention block. Along with these word embeddings, we also need to provide positional information.

To do this, we calculate positional encodings for both words. Positional encoding is not a single value; it is a vector. In fact, each word is assigned its own positional encoding vector.

Now consider the positional encoding vector for the word river. The dimensionality of the positional encoding vector is the same as the dimensionality of the word embedding vector. For example, if the word embedding dimension is 6, then the positional encoding vector for river will also be 6-dimensional. Similarly, the positional encoding vector for the word bank will also be a 6-dimensional vector.

Earlier, one possible approach was to concatenate the word embedding with the positional encoding. However, in the Transformer architecture, a different approach is used. Instead of concatenation, we take the word embedding vector and the positional encoding vector—both having the same dimensionality—and perform vector addition.

By adding these two vectors element-wise, we obtain a single 6-dimensional vector that combines both word meaning and positional information. This combined vector is then passed to the self-attention block.

If we were to use concatenation, combining a 6-dimensional word embedding with a 6-dimensional positional encoding would result in a 12-dimensional vector. This would increase the number of parameters in the model and also increase training time. To avoid this unnecessary increase in dimensionality and computational cost, the Transformer uses vector addition instead of concatenation.

This is how positional encoding is integrated with word embeddings before being fed into the self-attention mechanism.

HOW THESE VALUES ARE CALCULATED ?

To create a 6-dimensional positional encoding vector for the word river, we use sine and cosine functions. For a positional encoding of dimension d, we use d/2 sine–cosine pairs. Since the embedding dimension here is 6, we use three sine–cosine pairs.

We construct the positional encoding vector two dimensions at a time.

First, we compute one sine and one cosine value and place them in the first two dimensions.

Next, we use a different sine–cosine pair (with a different frequency) to fill the third and fourth dimensions.

Finally, we use another sine–cosine pair to fill the fifth and sixth dimensions.

This process gives us a 6-dimensional positional encoding vector for the word river. We apply the same procedure to every other word, such as bank.

If the embedding dimension were 8, the positional encoding would also be 8-dimensional, and we would use four sine–cosine pairs. In the original Attention Is All You Need paper, the embedding dimension (d_model) is 512, so the positional encoding is also 512-dimensional.

As we add more sine–cosine pairs, their frequencies decrease. This follows a fixed pattern. The natural question is: how are these frequencies decided ?

The original paper defines positional encoding using the following formulas:

The denominator controls the frequency of each sine and cosine function.

Now let us apply this formula to our example.

Here:

• pos represents the position of the word in the sentence.

• d_model is the embedding dimension.

• i ranges from 0 to (d_model / 2) − 1.

Assume the sentence is river bank.

The word river is at position 0, and bank is at position 1.

We assume the embedding dimension (d_model) is 6.

So, i ranges from 0 to 2.

Positional Encoding for river (pos = 0)

For i = 0:

PE(0, 0) = sin(0 / 10000⁰) = 0

PE(0, 1) = cos(0 / 10000⁰) = 1

For i = 1:

PE(0, 2) = sin(0 / 10000¹ᐟ³) = 0

PE(0, 3) = cos(0 / 10000¹ᐟ³) = 1

For i = 2:

PE(0, 4) = sin(0 / 10000²ᐟ³) = 0

PE(0, 5) = cos(0 / 10000²ᐟ³) = 1

So, the positional encoding vector for river is:

[0, 1, 0, 1, 0, 1]

Positional Encoding for bank (pos = 1)

For i = 0:

PE(1, 0) = sin(1 / 10000⁰) ≈ 0.84

PE(1, 1) = cos(1 / 10000⁰) ≈ 0.54

For i = 1:

PE(1, 2) = sin(1 / 10000¹ᐟ³) ≈ 0.04

PE(1, 3) = cos(1 / 10000¹ᐟ³) ≈ 0.99

For i = 2:

PE(1, 4) = sin(1 / 10000²ᐟ³) ≈ 0.00

PE(1, 5) = cos(1 / 10000²ᐟ³) ≈ 0.99

So, the positional encoding vector for bank is:

[0.84, 0.54, 0.04, 0.99, 0.00, 0.99]

This is how positional encoding values are calculated for any word, based on its position in the sentence and the embedding dimension.

OBSERVATIONS

This graph is related to positional encoding and has become quite popular, so let us interpret it carefully. The key question it highlights is what happens when we add more sine–cosine pairs to increase the embedding dimension. A clear pattern emerges: with every additional sine–cosine pair, the frequency keeps decreasing.

Assume we have a sentence with 50 words, and each word has an embedding dimension of 128, which means d_{model} = 128. For these 50 words, we compute 50 positional encoding vectors, and each positional encoding vector is 128-dimensional. In other words, every word in the sentence is associated with a unique 128-dimensional positional encoding vector.

In the graph, we take these 50 positional encoding vectors and visualize them using a heatmap. Each row corresponds to one word position (50 rows in total), and each column corresponds to one embedding dimension (128 columns). The positional encoding vectors are stacked from top to bottom, starting with the first word in the sentence.

One immediate observation is the alternating white–blue pattern at the beginning. Here, white represents a value close to zero, and blue represents a value close to one. For the first word, we observe values like 0, 1, 0, 1, 0, 1, and so on, which matches the positional encoding we derived earlier.

A more interesting observation appears when we compare positional encodings across different words. Most of the visible changes occur in the lower dimensions. As we move to higher dimensions, the values change very slowly or almost not at all. If we ignore the lower dimensions, it becomes difficult to differentiate words using only the higher-dimensional components.

This happens because lower dimensions are generated using high-frequency sine–cosine functions, while higher dimensions use low-frequency sine–cosine functions. High-frequency curves change rapidly with position, whereas low-frequency curves require a much longer sequence length to show noticeable variation.

If instead of 50 words we had 100 words, then higher dimensions would also begin to show variation. This is because the x-axis (sequence length) would be longer, giving low-frequency curves enough space to change. As the number of words increases, more dimensions start to participate, and eventually all dimensions exhibit meaningful variation.

We can understand this behavior using a binary encoding analogy. In binary encoding, when we convert decimal numbers to binary, the lowest bit changes most frequently. The next bit changes every two steps, the next every four steps, then eight, and so on. For numbers from 0 to 15, if we use only 4 bits, the higher bits change much less frequently. If we used 8 bits, the highest bits might not change at all for small numbers.

Positional encoding follows a similar idea, except it operates in a continuous value domain instead of a discrete binary domain. The sine–cosine functions act like binary bits: lower dimensions behave like lower bits with high frequency, and higher dimensions behave like higher bits with low frequency.

In this sense, positional encoding can be viewed as a continuous analog of binary encoding, implemented using sine and cosine functions. This design allows the model to represent positional information smoothly and consistently across multiple scales, making it effective for modeling long sequences.

SOLUTION

We initially proposed a simple positional solution, but it suffered from three major problems: it was unbounded, it used discrete values, and it could not capture relative positioning. To overcome these limitations, we introduced sine–cosine–based positional encoding. Let us now understand how sine and cosine help in capturing relative position information.

Assume we have a sentence with 50 words. This means we generate 50 positional encoding vectors, and each vector has 128 dimensions. Effectively, we are working in a 128-dimensional space that contains 50 vectors—one vector for each word position.

Because we apply sine and cosine functions with varying frequencies, the resulting positional encodings exhibit an important and interesting property. These 50 vectors are structured such that relative shifts between positions can be represented using linear transformations.

For example, suppose we take the positional encoding vector at position 10, denoted as v₁₀. If we apply a particular linear transformation (a matrix) to v₁₀, we obtain v₂₀. If we apply the same matrix to v₃₀, we get v₄₀. Applying it again to v₄₀ gives v₅₀. This indicates that this specific matrix represents a fixed positional shift of 10 positions.

This property is not limited to a single shift value. Consider another example. Let us take v₅. If we apply a certain matrix to v₅, we obtain v₁₀. Applying the same matrix to v₁₂ gives v₁₇, and applying it to v₂₁ gives v₂₆. In this case, the matrix represents a positional shift of 5 positions.

In this system, for every relative distance (Δ), there exists a corresponding linear transformation that maps one positional encoding vector to another. Because of this structure, the model can move consistently across positions using linear operations.

As a result, sine–cosine positional encoding naturally enables the model to understand relative positioning. The model does not rely only on absolute position indices; instead, it learns how positions relate to one another through these structured transformations. This is why sine–cosine positional encoding is effective in capturing relative positional information within the Transformer architecture.

Positional Encoding in Transformers Explained from First Principles

Recent Posts

© 2025 Aryan Upadhyay |