What is a GRU? Gated Recurrent Units Explained (Architecture & Math)
- Aryan

- Feb 6
- 12 min read
What is a GRU ?

Gated Recurrent Units (GRUs) are a specialized architecture within the Recurrent Neural Network (RNN) family, designed specifically to process sequential data. To understand why they exist, we first need to look at the evolution from simple RNNs to LSTMs.
The Problem with Simple RNNs
When we study simple RNNs, we see that while they are effective for short sequences, they have a major flaw when dealing with longer data streams. As the sequence grows, RNNs suffer from the vanishing and exploding gradient problem.
Because of this, simple RNNs fail to retain long-term context. They "forget" information from the beginning of a sequence by the time they reach the end. This was a critical bottleneck that needed a solution.
The LSTM Solution (and its Cost)
The Long Short-Term Memory (LSTM) network, introduced in 1997, was designed to solve this specific problem. With a more complex internal architecture, LSTMs can retain both long-term and short-term context, effectively mitigating the vanishing gradient issue.
However, this capability comes at a price. LSTMs utilize a complex structure involving three distinct gates. This complexity results in a large number of parameters, which can make training on massive datasets computationally expensive and time-consuming.
Enter the GRU (2014)
So, why do we need GRUs if LSTMs already work?
Introduced in 2014, the GRU was developed to address the computational heaviness of LSTMs. The primary motivation behind the GRU is efficiency. It offers a simpler architecture compared to the LSTM:
LSTM: Uses 3 gates.
GRU: Uses only 2 gates.
Because of this streamlined design, GRUs have fewer parameters to update. This directly translates to faster training times, making them highly attractive for large datasets or resource-constrained environments.
Performance: GRU vs. LSTM
The best part about the GRU is that despite its simpler architecture, its performance is often comparable to that of the LSTM.
Empirical Evidence: Studies show that on certain tasks, GRUs can actually outperform LSTMs. On others, LSTMs maintain the edge.
No Clear Winner: There is no "silver bullet." The GRU is not strictly better in every sense; it is simply a more efficient alternative that balances performance with training speed.
When building deep learning models, we often treat the choice between LSTM and GRU as a hyperparameter to be tuned. We should test both to see which fits the specific problem better. However, because GRUs provide a lightweight architecture with competitive performance, they are an essential concept for any Data Scientist to master.
The Core Idea Behind GRU
To understand the Gated Recurrent Unit (GRU), it helps to look back at the LSTM. The LSTM architecture was built on the idea of maintaining two separate states to track context: a cell state for long-term memory and a hidden state for short-term memory. It managed these using three distinct gates.
The GRU simplifies this concept significantly. The main idea behind the GRU is that we don't need two separate states to retain long-term and short-term context. Instead, the GRU merges them into a single hidden state.
Designed for efficiency, the GRU uses this single state to carry both short and long-term context simultaneously. To manipulate this state, it employs just two gates: the Reset Gate and the Update Gate.
The Setup

The Goal
The objective of the GRU architecture is straightforward. For any given timestamp t , the unit takes two inputs:
The previous hidden state (hₜ₋₁)
The current input (xₜ)
Using these inputs, the GRU calculates and outputs the current hidden state (hₜ).
Terminology & Vectors
Here are the key terms we use to describe the architecture. Mathematically, all of these represent vectors (sets of numbers):
h: Hidden state
hₜ₋₁: Previous hidden state
hₜ: Current hidden state (at timestamp t)
xₜ: Current input
rₜ: Reset gate
zₜ: Update gate
h̃ₜ: Candidate hidden state
A Note on Dimensionality:
It is important to remember that these vectors can be of any dimension. However, there is a strict rule regarding their consistency: apart from the input vector xₜ , all other vectors (h, r, z, h̃) must share the same dimensionality. For example, if hₜ₋₁ is a 4-dimensional vector, then hₜ, rₜ, and zₜ must also be 4-dimensional. The input xₜ is the only exception and can have a different dimension.
The Architecture (The "Yellow Boxes")

In standard architecture diagrams, you will often see yellow boxes. These represent Fully Connected Neural Network layers.
Hyperparameters: The number of nodes in these layers is a hyperparameter, meaning we decide the count.
Activations:
Boxes labeled with σ represent a layer where all nodes use the Sigmoid activation function.
Boxes labeled with tanh represent a layer where all nodes use the Tanh activation function.
Connecting Layers to Dimensions:
There is a direct link between these layers and our vectors. The number of nodes in these layers is identical across the architecture (e.g., if one layer has 32 nodes, they all do). Crucially, the dimension of our vectors (hₜ, rₜ, etc.) is exactly equal to this number of nodes.
Pointwise Operations
Finally, the architecture relies on pointwise (or element-wise) operations. This simply means applying an operation to corresponding elements in two vectors.
For example, if we have two vectors A = [a, b, c] and B = [d, e, f] :
Pointwise Multiplication: [a . d, b . e, c . f]
Pointwise Addition: [a + d, b + e, c + f]
The Input (xₜ)
The term xₜ simply represents the input vector at the current timestamp t . To understand this better, let's look at a practical example using Sentiment Analysis.
Imagine we have a dataset consisting of three short reviews:
Text | Review Sentiment |
Cat mat rat | 1 (Positive) |
Cat rat rat | 0 (Negative) |
Mat rat mat | 1 (Positive) |
From Words to Numbers
A GRU (or any neural network) cannot understand raw text; it only understands numbers. Therefore, we must convert our text into numerical vectors using a process called Text Vectorization. Common techniques include One-Hot Encoding (OHE), Bag of Words (BoW), or Word2Vec.
For this example, let's use One-Hot Encoding.
First, we count the total unique words in our vocabulary. Here, we have 3 unique words: Cat, Mat, Rat.
We can represent each word as a 3-dimensional vector:
Cat: [1, 0, 0]
Mat: [0, 1, 0]
Rat: [0, 0, 1]
Using this mapping, we can translate our sentences into a sequence of vectors:
Sentence 1 ("Cat mat rat"):
t = 1: [1, 0, 0]
t = 2: [0, 1, 0]
t = 3: [0, 0, 1]
Sentence 2 ("Cat rat rat"):
t = 1: [1, 0, 0]
t = 2: [0, 0, 1]
t = 3: [0, 0, 1]
Sentence 3 ("Mat rat mat"):
t = 1: [0, 1, 0]
t = 2: [0, 0, 1]
t = 3: [0, 1, 0]
Understanding Timestamps
When processing sequential data, we pass one "unit" (in this case, one word) at a time.
Processing "Cat" happens at timestamp t = 1.
Processing "mat" happens at timestamp t = 2.
Processing "rat" happens at timestamp t = 3.
So, for a single sentence with three words, the GRU will run through three distinct timestamps. This process is repeated for every sentence in our dataset. This is why xₜ is always a vector—it is the numerical representation of the specific word being processed at that moment.
Architecture Overview
The ultimate goal of the GRU at any given timestamp is simple: take the previous hidden state (hₜ₋₁) and the current input (xₜ), and use them to calculate the new current hidden state (hₜ).
We can break this process down into four distinct steps:
Calculate the Reset Gate (rₜ): Determines how much of the past information needs to be forgotten.
Calculate the Candidate Hidden State (h̃ₜ): Creates a temporary new state containing relevant information from the current input and the past.
Calculate the Update Gate (zₜ): Decides what information will be carried forward to the future.
Calculate the Current Hidden State (hₜ): Combines the results to produce the final output for this timestamp.
What Exactly is the Hidden State (hₜ) ?

To understand the hidden state intuitively, let's move away from numbers for a moment and look at a story. The hidden state functions as the memory of the system. When we feed a sequence into a GRU (or LSTM), its job is to maintain the context of that sequence over time.
Imagine we are processing a saga about two rival kingdoms. We want to query the GRU at the end: Is this a happy or sad story? To answer this, the model must retain the context of every sentence passed so far.
The "King's Story" Analogy
Let's assume our hidden state vectors hₜ are 4-dimensional.
Reality: In a real neural network, these dimensions are a "black box"—we don't know exactly what each number represents.
Intuition: For this example, let's assign specific meanings to each dimension to see how the "memory" evolves:
Vector = [Power, Conflict, Tragedy, Revenge]
Let's trace how the vector evolves as we feed the story into the GRU timestamp by timestamp (t).
The Evolution of Context
Timestamp t = 1: "There was a king, very strong and powerful."
Context: Introduction of a strong leader.
Vector: [0.9, 0.0, 0.0, 0.0]
Interpretation: High Power, no conflict yet.
Timestamp t = 2: "There was an enemy king."
Context: A rival appears.
Vector: [1.0, 0.5, 0.0, 0.0]
Interpretation: Power stays high (two kings), Conflict begins to rise.
Timestamp t = 3: "Both had a war and the enemy killed the king."
Context: The protagonist dies.
Vector: [0.7, 0.7, 0.4, 0.0]
Interpretation: Power drops (king died), Conflict peaks (war), Tragedy enters the memory.
Timestamp t = 4: "The King had a son, Jr., who grew up to become very strong."
Context: A new strong leader emerges.
Vector: [0.9, 0.6, 0.3, 0.0]
Interpretation: Power rises again, Tragedy fades slightly as hope returns.
Timestamp t = 5: "He also attacked the enemy king but got killed."
Context: History repeats itself.
Vector: [0.8, 0.8, 0.6, 0.5]
Interpretation: Tragedy spikes again. We now see the seeds of Revenge appearing.
Timestamp t = 6: "Jr. had a son, Super Jr., who killed the enemy king and took revenge."
Context: The cycle ends.
Vector: [0.7, 0.8, 0.3, 1.0]
Interpretation: Revenge is fully realized (1.0). The story concludes with high conflict, some remaining tragedy, but a definitive resolution.
This demonstrates how a GRU maintains context. It doesn't just store the last word; it carries a weighted "summary" of the entire narrative in the hidden state vector.
The Mechanics: How the Transition Happens

Now that we understand what the hidden state represents, how does the GRU mathematically calculate it? How do we move from the old memory (hₜ₋₁) to the new memory (hₜ)?
The GRU manages this process in two distinct phases, using two gates to control the flow of information.
Phase 1: Creating the Candidate Hidden State (h̃ₜ)
Goal: Create a "draft" or "candidate" for the new memory based heavily on the current input (xₜ).
The Helper: The Reset Gate (rₜ).
The Logic: We take the previous memory (hₜ₋₁) and the current input (xₜ). The Reset Gate decides how much of the past is relevant to the current input. If the new input represents a massive topic shift, the Reset Gate might tell the model to "forget" much of the past when creating this candidate state.
Phase 2: Finalizing the Current Hidden State (hₜ)
Goal: Decide the final memory vector for this timestamp.
The Helper: The Update Gate (zₜ).
The Logic: This is a balancing act. The model has two choices: keep the old memory (hₜ₋₁) or accept the new candidate memory (h̃ₜ).
If the current input is critical (e.g., a plot twist), the Update Gate gives more weight to the new candidate.
If the current input is noise (e.g., a filler word like "the"), the Update Gate preserves the old memory.
Reset Gate (rₜ): Helps us get from hₜ₋₁ → h̃ₜ (Focus: Short-term relevance).
Update Gate (zₜ): Helps us get from h̃ₜ → hₜ (Focus: Long-term balance).
Step 1: Calculating the Reset Gate (rₜ)
The first operation in the GRU architecture is the Reset Gate.

The Concept
The Reset Gate (rₜ) is a vector with the same dimensionality as the hidden state (hₜ₋₁).
Values: Each number in this vector is between 0 and 1 (due to the Sigmoid activation).
Function: It acts as a filter for the past.
Close to 0: "Shut the gate." Ignore this part of the past memory.
Close to 1: "Open the gate." Keep this part of the past memory.
Intuition (The Story):
Imagine our "King" story. We have read three sentences; the King has died (hₜ₋₁). The new sentence (xₜ) introduces the King's son.
The Reset Gate looks at this new input and decides: "Okay, the old King is dead. We need to reset the 'Power' and 'Conflict' context associated with him to make room for the son, but we must retain the 'Revenge' context."
Example:
Past Memory hₜ₋₁: [0.6, 0.6, 0.7, 0.1] (Power, Conflict, Tragedy, Revenge)
Reset Gate rₜ: [0.8, 0.2, 0.1, 0.9]
Interpretation: We keep 80% of the generic power context, but we reset Conflict (0.2) and Tragedy (0.1) to start fresh for the new character. However, we keep 90% of the Revenge context (0.9) because that plot point continues.
The Math
To calculate rₜ, we use the previous hidden state and the current input.
rₜ = σ(Wᵣ ⋅ [hₜ₋₁, xₜ] + bᵣ)
Concatenation: We join hₜ₋₁ (e.g., 4 dimensions) and xₜ (e.g., 3 dimensions) into a single vector.
Linear Transformation: We multiply this by a weight matrix Wᵣ (which contains learned parameters) and add a bias bᵣ .
Note on dimensions: If h is size 4 and x is size 3, our concatenated vector is size 7. The weight matrix will transform this back to size 4 to match our hidden state.
Activation: We apply the Sigmoid function (σ) to squash all values between 0 and 1.
Step 2: The Candidate Hidden State (h̃ₜ)
Now that we have our "Reset" instructions (rₜ), we create a proposal for the new memory. This is called the Candidate Hidden State.
1. Modulating the Past
First, we apply the reset gate to the past memory using pointwise multiplication:
hᵣₑₛₑₜ = rₜ ⊙ hₜ₋₁
Using our previous numbers:
[0.8, 0.2, 0.1, 0.9] ⊙ [0.6, 0.6, 0.7, 0.1] = [0.48, 0.12, 0.07, 0.09]
We have successfully faded out the irrelevant parts of the past (Conflict and Tragedy) while keeping the relevant parts.
2. generating the Candidate
We take this "clean" past memory, mix it with the current input xₜ , and pass it through a Tanh layer. Tanh is used here because it allows values between -1 and 1, allowing the model to build complex feature representations.
h̃ₜ = tanh(W_c ⋅ [rₜ ⊙ hₜ₋₁, xₜ] + b_c)
Let's assume the result of this calculation is our candidate vector:
h̃ₜ = [0.7, 0.2, 0.1, 0.2]
This vector represents the new memory proposed purely based on the current input and the relevant parts of the past.
Step 3: The Update Gate (zₜ)
Before we finalize the memory, we need the Update Gate. This gate decides the balance: How much of the old memory should we keep, and how much of the new candidate should we accept?
The calculation is identical to the Reset Gate (using Sigmoid), but with its own unique set of weights (W_z):
zₜ = σ(W_z ⋅ [hₜ₋₁, xₜ] + b_z)
Let's assume our calculated zₜ is:
zₜ = [0.7, 0.7, 0.8, 0.2]
Step 4: The Final Calculation (hₜ)
Now we combine everything to produce the final hidden state for the current timestamp. We use the Update Gate zₜ to weigh the Old Memory (hₜ₋₁) against the New Candidate (h̃ₜ).
The Formula
hₜ = (1 − zₜ) ⊙ hₜ₋₁ + zₜ ⊙ h̃ₜ
If zₜ is close to 1: We value the Candidate (New) info more.
If zₜ is close to 0: We value the Past (Old) info more.
Numerical Walkthrough
Let's calculate the first dimension of our final vector to see exactly how this math works.
Past (hₜ₋₁): 0.6
Candidate (h̃ₜ): 0.7
Update Gate (zₜ): 0.7 (This means we want 70% of the new info, 30% of the old).
hₙₑ_w = (1 - 0.7) 0.6 + (0.7 0.7)
hₙₑ_w = (0.3 * 0.6) + 0.49
hₙₑ_w = 0.18 + 0.49
hₙₑ_w = 0.67
Interpretation:
The model decided that for this specific feature, the new input was very important (z = 0.7). So, the value shifted from 0.6 (old) to 0.67 (closer to the new candidate of 0.7).
Conversely, look at the 4th dimension where zₜ = 0.2:
Past: 0.1
Candidate: 0.2
Update Gate: 0.2 (Low update score → keep the past).
hₙₑ_w = (0.8 0.1) + (0.2 0.2) = 0.08 + 0.04 = 0.12
Here, the value barely changed from the old memory (0.1), because the gate said "don't update this much."
This dynamic balancing act happens for every dimension at every timestamp, allowing the GRU to carry context over long sequences efficiently.
LSTM vs. GRU (Comparison)
Now that we understand the internal mechanics of the GRU, let's compare it directly to its predecessor, the LSTM.
1. Architecture & Gates
LSTM: Utilizes three gates (Input, Forget, and Output).
GRU: Streamlines the process with just two gates (Reset and Update).
2. Memory Units
LSTM: Maintains two separate states: the Cell State (cₜ) for long-term internal memory and the Hidden State (hₜ) for output.
GRU: Simplifies this by merging both functions into a single Hidden State (hₜ), which captures both long-term and short-term context simultaneously.
3. Parameter Efficiency
Because the LSTM has an additional gate and a separate cell state, it requires significantly more parameters.
LSTM parameters: ≈ 4 × ((d × h) + (h × h) + h)
GRU parameters: ≈ 3 × ((d × h) + (h × h) + h)
(Where d is input size and h is hidden size).
Impact: The GRU has fewer parameters to update, which directly translates to faster training speeds and lower computational costs.
4. Performance & Usage
Empirical Results: In many tasks, especially those with smaller datasets, GRUs perform comparably to LSTMs while training faster. However, for very complex tasks requiring long-range dependencies, LSTMs may still have a slight edge.
The Verdict: There is no clear winner. The choice often comes down to empirical testing (trial and error). However, due to their efficiency, GRUs are often the best starting point for new projects.


