How LSTMs Work: A Deep Dive into Gates and Information Flow

Aryan
Feb 4
10 min read

The Gates: Controlling the Flow in LSTMs

The sophisticated architecture of an LSTM revolves around two main tasks: updating the cell state and calculating the hidden state. This precise control over information flow is accomplished through a mechanism known as "gates."

The architecture is fundamentally divided into three distinct gates, each serving a critical purpose in the network's memory:

The Forget Gate: Its role is to decide what information is no longer relevant and should be removed from the current cell state.
The Input Gate: Conversely, this gate determines what new information is significant enough to be added to the cell state.
- Note: By combining the filtering action of the Forget Gate and the additive action of the Input Gate, we successfully update our cell state.
The Output Gate: Finally, this gate’s job is to calculate the hidden state for the next timestamp based on our newly updated information.

What Are Cₜ and hₜ ?

A Mathematical Perspective

Now, let's look "under the hood" to see what these hidden states, cell states, and gate operations actually look like mathematically. Specifically, we need to understand what those yellow units in the diagrams represent.

1. The State Vectors (Cₜ and hₜ)

First, we need to define what the Cell State (Cₜ) and Hidden State (hₜ) actually are. Mathematically, both of these are vectors (collections of numbers).

There is a strict architectural rule in LSTMs regarding their shape:

The dimensions of Cₜ and hₜ are always equal.
If your hidden state is a vector of size 512, your cell state will also be a vector of size 512. This synchronization is critical for the LSTM to function.

2. The Input Vector (xₜ)

The input at the current timestamp, denoted as xₜ, is also a vector. However, unlike the relationship between Cₜ and hₜ, there is no restriction stating xₜ must have the same dimension as the states.

To understand how xₜ is calculated and fed into the network, let's look at a practical example using a Sentiment Analysis problem. Imagine we have the following dataset:

Text	Sentiment
Cat mat rat	0
Cat rat rat	0
Mat mat cat	1

Since LSTMs cannot process raw text, we must use vectorization techniques to convert these words into numbers. A common method is One-Hot Encoding.

The Vectorization Process:

Identify Unique Words: In our data, we have three unique words: cat, mat, and rat.
Create Vectors: We can represent each word as a 3-dimensional vector:
- Cat → [1, 0, 0]
- Mat → [0, 1, 0]
- Rat → [0, 0, 1]

The Timestamp Flow:

If we take our first sentence, "Cat mat rat," the input to the LSTM flows word by word, creating a distinct timestamp for each input:

Timestamp 1: We send "Cat" [1, 0, 0]. The LSTM processes this and outputs C₁ and h₁.
Timestamp 2: We send "Mat" [0, 1, 0] along with the previous C₁ and h₁. The LSTM outputs C₂ and h₂.
Timestamp 3: We send "Rat" [0, 0, 1] along with C₂ and h₂.

Key Takeaway:

While we used One-Hot Encoding here, xₜ can be generated using various vectorization techniques (like Word2Vec or GloVe). Regardless of the technique, xₜ remains a vector of numbers. The critical thing to remember is that while Cₜ and hₜ must match in size, the dimension of the input xₜ is independent.

Understanding the Components: fₜ, iₜ, oₜ, and C̃ₜ

Now that we understand the states and inputs, let’s identify the specific vectors that operate within the gates. These are the internal mechanisms that drive the LSTM's decision-making process.

fₜ (Forget Gate Vector): Located within the forget gate, this vector determines which information to discard.
iₜ (Input Gate Vector): Found in the input gate, this vector decides which values we will update.
C̃ₜ (Candidate Cell State): Also found in the input gate, this vector contains the new candidate values that could be added to the state.
oₜ (Output Gate Vector): Located in the output gate, this vector controls what part of the cell state makes it to the output.

A Critical Note on Dimensions:

Just like the cell state (Cₜ) and hidden state (hₜ), all four of these components—fₜ, iₜ, oₜ, and C̃ₜ—are vectors.

More importantly, they share the exact same shape. If your hidden state size is 512, then fₜ, iₜ, oₜ, and C̃ₜ will all be vectors of size 512. This uniformity is essential for the element-wise operations (like addition and multiplication) that occur within the cell.

Pointwise Operations

In LSTM diagrams, you will frequently see symbols like +, ×, and tanh. These represent pointwise operations, meaning they happen between two vectors element-by-element.

There are three primary pointwise operations we need to understand:

Pointwise Multiplication (×):
This requires two vectors of the exact same dimension.
- Example: Let's say we have a previous cell state vector Cₜ₋₁ = [4, 5, 6] and a forget gate vector fₜ = [1, 2, 3].
- When we apply pointwise multiplication (Cₜ₋₁ × fₜ), we multiply the corresponding elements:
  [4×1, 5×2, 6×3] = [4, 10, 18]
Pointwise Addition (+):
Similar to multiplication, this adds the corresponding elements of two vectors.
- Example: Using the same vectors as above:
  [4+1, 5+2, 6+3] = [5, 7, 9]
Pointwise Tanh:
This operation applies the hyperbolic tangent function to every single number in the vector independently.
- Example: If we have a vector [0.2, 0.4, 0.6] and apply pointwise tanh, the function calculates the value for 0.2, then 0.4, and so on.
- The result would be a new vector of compressed values, approximately: [0.19, 0.38, 0.53].

Key Rule: Pointwise operations always occur between vectors of the same shape/dimension, and the output is always a vector of that same size.

Neural Network Layers (The Yellow Boxes)

In standard LSTM diagrams, the yellow boxes represent Neural Network Layers.

If you recall from Artificial Neural Networks (ANNs), a network consists of an input layer, hidden layers, and an output layer. The "yellow boxes" in an LSTM are functionally identical to the hidden layers you have studied in ANNs.

Structure: Each yellow box is a collection of nodes (neurons).
Activations: The label inside the box (e.g., σ for Sigmoid or tanh) tells us which activation function is applied to the nodes in that specific layer. For instance, the yellow box in the "forget gate" is simply a neural network layer where every node uses the Sigmoid activation function.

Hyperparameters and Consistency:

The number of nodes in these layers is flexible—it is a hyperparameter that you, the architect, decide. However, once you choose a size (e.g., 512 nodes), you must be consistent.

If you set the hidden size to 4, then every yellow box in that LSTM cell will have exactly 4 nodes.
This consistency ensures that the vectors produced by these layers (fₜ, iₜ, oₜ, C̃ₜ) all have the same dimension, allowing the pointwise operations we discussed above to work correctly.

The Forget Gate: Filtering Long-Term Memory

The first major operation in an LSTM is the Forget Gate. Its primary purpose is to decide what information from the previous cell state (Cₜ₋₁) is no longer relevant for the current context and should be removed.

This process happens in two distinct stages:

Calculating the Forget Vector (fₜ): A neural network layer decides what to forget.
Applying the Filter: We perform a pointwise operation to actually remove that information from the cell state.

1. Calculating fₜ (The Neural Network Layer)

To decide what to keep, the Forget Gate looks at two specific inputs:

The previous hidden state (hₜ₋₁).
The current input vector (xₜ).

A Mathematical Example:

Let's break down the dimensions to see how this works under the hood. Suppose we choose the following hyperparameters:

Number of nodes (hidden units): 3
Input feature size (xₜ): 4

This implies that our hidden state vectors (hₜ₋₁) and cell state vectors (Cₜ₋₁) must also have a dimension of 3.

The Matrix Operation:

When the inputs enter the gate, we concatenate the previous hidden state and the current input.

Concatenation: We join hₜ₋₁ (size 3) and xₜ (size 4) to create a combined vector of size 7.
Weights W₍f₎: Since we have 7 inputs flowing into 3 neurons, our weight matrix will have a shape of 3 × 7 (21 total weights).
Bias b₍f₎: We also add a bias vector of size 3 × 1.

We perform matrix multiplication (dot product) between the weights and the concatenated input, add the bias, and pass the result through a Sigmoid activation function:

fₜ = σ(W₍f₎ ⋅ [hₜ₋₁, xₜ] + b₍f₎)

The result, fₜ, is a vector of size 3 × 1. This matches the dimension of our hidden units.

2. The Pointwise Operation (The Actual "Forgetting")

Now that we have the forget vector fₜ, we move to the second stage: updating the cell state.

We take the previous cell state (Cₜ₋₁) and perform pointwise multiplication with fₜ:

Cₙₑ_w = fₜ × Cₜ₋₁

The Intuition: Why call it a "Gate"?

The term "gate" is used because the sigmoid function outputs values between 0 and 1. This allows fₜ to act like a physical valve or gate that regulates the flow of information.

Let's look at a concrete example using your numbers:

Assume the previous cell state Cₜ₋₁ = [4, 5, 6].

Scenario A: Partial Forgetting

If our network calculates fₜ = [0.5, 0.5, 0.5], the math looks like this:

[0.5, 0.5, 0.5] × [4, 5, 6] = [2, 2.5, 3]

Result: We have successfully "forgotten" half of the information/magnitude from the long-term memory.

Scenario B: Remembering Everything

If fₜ = [1, 1, 1]:

[1, 1, 1] × [4, 5, 6] = [4, 5, 6]

Result: The gate is fully open; no information is lost.

Scenario C: Forgetting Everything

If fₜ = [0, 0, 0]:

[0, 0, 0] × [4, 5, 6] = [0, 0, 0]

Result: The gate is fully closed; all long-term context is wiped out.

In summary, the Forget Gate (fₜ) empowers the neural network to intelligently decide—based on the current input (xₜ) and past context (hₜ₋₁)—exactly how much of the old memory should be preserved.

The Input Gate: Adding New Information

While the Forget Gate decides what to remove, the Input Gate has the opposite job: it decides what new, relevant information should be added to the cell state (Cₜ).

This process occurs in three distinct stages:

Creation: Calculating a "Candidate" vector containing potential new values.
Selection: Calculating an "Input" vector to decide which of those candidates are worth keeping.
Update: Merging everything to create the final cell state.

Stage 1: The Candidate Cell State (C̃ₜ)

First, we need to create a vector of new information based on the current context. This is handled by a neural network layer with a Tanh activation function.

Inputs: It takes the previous hidden state (hₜ₋₁) and current input (xₜ).
Operation: Just like the forget gate, we concatenate these inputs and pass them through the network weights (W_c) and biases (b_c).
Why Tanh? Tanh outputs values between -1 and 1. This allows the network to propose adding positive or negative values to the state (increases or decreases), regulating the "direction" of the memory update.

C̃ₜ = tanh(W_c ⋅ [hₜ₋₁, xₜ] + b_c)

Dimensions:

Using our running example (Hidden units = 3, Input features = 4):

Concatenated Input: 7 × 1
Weights (W_c): 3 × 7
Result: A 3 × 1 vector of new candidate values.

Stage 2: The Input Gate Filter (iₜ)

We don't necessarily want to add all the new candidate information. Some of it might be noise. We need a filter.

This is handled by a parallel neural network layer, but this time with a Sigmoid activation function.

The Logic: The Sigmoid function outputs values between 0 and 1.
- 0: "Ignore this new candidate value."
- 1: "add this new candidate value entirely."

iₜ = σ(Wᵢ ⋅ [hₜ₋₁, xₜ] + bᵢ)

Dimensions:

The weights (Wᵢ) are also 3 × 7, resulting in a 3 × 1 vector.

Stage 3: The Final Cell State Update

Now we combine the results from the Forget Gate and the Input Gate to calculate the true Cell State (Cₜ). This is the moment the memory is actually updated.

Filter the New Info: We perform pointwise multiplication on the output of Stage 1 and Stage 2 (iₜ × C̃ₜ). This keeps only the important new information.
Apply the Forget Gate: We take the previous cell state filtered by the forget gate (fₜ × Cₜ₋₁).
Add Them Together: We use pointwise addition to combine the old memory with the new memory.

The Final Equation:

Cₜ = (fₜ × Cₜ₋₁) + (iₜ × C̃ₜ)

Intuition: Solving the Vanishing Gradient Problem

One of the biggest issues in standard RNNs is the "Vanishing Gradient" problem, where information decays over long sequences because it is constantly being multiplied by weights less than 1.

LSTMs solve this through the additive nature of the cell state update equation shown above.

Example:

Imagine we have a critical piece of information in the cell state Cₜ₋₁ = [4, 5, 6].

If the Forget Gate is set to [1, 1, 1] (Keep everything).
And the Input Gate is set to [0, 0, 0] (Add nothing).

The math becomes:

Cₜ = ([1, 1, 1] × [4, 5, 6]) + ([0, 0, 0] × C̃ₜ)

Cₜ = [4, 5, 6]

The information is passed to the next timestamp unchanged. The LSTM acts like a protected conveyor belt; if new information isn't useful, the network can choose to ignore it completely and carry the old memory forward without degradation. This ability to maintain long-term context is the core power of the LSTM architecture.

The Output Gate: Calculating the Hidden State

The final step in the LSTM cycle is the Output Gate. Its job is to calculate the Hidden State (hₜ) for the current timestamp. This hₜ serves two purposes: it is the output prediction for this moment, and it will be passed forward to the next timestamp.

This process relies heavily on our newly updated Cell State (Cₜ). However, we don't just output the cell state directly; we filter it first. This happens in two main steps.

Step 1: Calculating the Output Filter (oₜ)

First, we need to decide which parts of the cell state we want to output. We do this using a neural network layer with a Sigmoid activation function (just like in the previous gates).

Inputs: The previous hidden state (hₜ₋₁) and the current input (xₜ).
The Operation: We filter the context to determine relevance.
- We concatenate hₜ₋₁ (size 3) and xₜ (size 4) to get a vector of size 7.
- We multiply this by the weight matrix Wₒ (size 3 × 7) and add the bias bₒ (size 3 × 1).
- The result passes through the Sigmoid function to get oₜ.

oₜ = σ(Wₒ ⋅ [hₜ₋₁, xₜ] + bₒ)

Dimensions:

Just like the other gates, since we have 3 hidden units, the resulting vector oₜ will be of size 3 × 1.

Step 2: Calculating the Hidden State (hₜ)

Now that we have our filter (oₜ) and our updated cell state (Cₜ), we can calculate the final output.

Normalize the Cell State: We push the current cell state (Cₜ) through a Tanh function. This squashes the values to be between -1 and 1.
Apply the Filter: We perform pointwise multiplication between our output gate vector (oₜ) and the normalized cell state.

hₜ = oₜ × tanh(Cₜ)

The Result:

oₜ is a 3 × 1 vector.
tanh(Cₜ) is a 3 × 1 vector.
The final result, hₜ, is a 3 × 1 vector.

This vector is now ready to be sent to the next timestamp as hₜ₋₁ and used for any immediate predictions (like classifying the sentiment of the sentence so far).

The Big Picture

Now that we have covered all three components, we can see the full flow of the LSTM architecture:

The Forget Gate (fₜ): Decides what information to remove from the long-term memory (Cₜ₋₁).
The Input Gate (iₜ & C̃ₜ): Decides what new information to add to the cell state.
The Output Gate (oₜ): Decides what the next hidden state (hₜ) should be, based on the updated memory.

This "Gating" mechanism is exactly what allows LSTMs to handle long sequences of data without losing context, solving the vanishing gradient problem found in standard RNNs.

How LSTMs Work: A Deep Dive into Gates and Information Flow

Recent Posts

© 2025 Aryan Upadhyay |