top of page

The Definitive Guide to Recurrent Neural Networks: Processing Sequential Data & Beyond

  • Writer: Aryan
    Aryan
  • Jan 25
  • 7 min read

Sequential Data and Recurrent Neural Networks (RNNs)

 

We have previously explored two fundamental neural network architectures: Artificial Neural Networks (ANNs), typically applied to tabular data, and Convolutional Neural Networks (CNNs), specialized for image data. Our focus now shifts to the third major type: the Recurrent Neural Network (RNN), which is specifically designed to handle sequential data.

 

What is Sequential Data ?

 

Sequential data is any type of data where the order or sequence of elements is critical to its meaning or prediction.

Consider a simple example with an ANN: predicting a student's placement based on their IQ score and gender. If we swap the order of these two inputs (IQ then Gender, or Gender then IQ), the output of the ANN remains unchanged. This is characteristic of non-sequential data, where sequence is irrelevant.

However, many real-world datasets depend heavily on sequence:

  • Text (Natural Language): We understand a sentence word-by-word, where the context and semantic meaning are constructed by the sequence of words.

  • Time Series Data: Predicting a company's stock price or sales requires knowledge of past trends and values, as future events are sequentially dependent on the past.

  • Speech and Audio: Processing sounds and spoken language relies on the sequential nature of acoustic signals.

  • Biological Data: DNA and protein sequences are fundamentally sequential.

The Recurrent Neural Network (RNN) is a unique type of neural network structure built to process this kind of sequential information effectively. The development of RNNs was a major breakthrough, particularly revolutionizing the Natural Language Processing (NLP) domain.

 

Why Do We Need RNNs? The Limitations of ANNs

 

To understand the necessity of RNNs, let's look at the challenges we face when trying to apply a standard ANN to sequential data, such as a text classification task (e.g., classifying a sentence as positive or negative).

 

1. The Varying Input Size Problem

 

Textual data is inherently variable in length. One sentence might have 5 words, and the next might have 20.

  • When preparing text for an ANN, we first use vectorization (like One-Hot Encoding or Word Embeddings) to represent each word as a numerical vector.

  • For an ANN, the input layer size must be fixed. If a sentence has N words and the word vector size is V, the total input size is N * V.

  • If we try to feed sentences of different lengths into the same ANN, the network's architecture (specifically, the number of input neurons and weights) will not match the required input size, leading to an immediate failure.

 

2. The Flawed "Zero Padding" Solution

 

A common, but problematic, workaround to the varying size problem is zero padding.

  • We determine the maximum sentence length (Nₘₐₓ) in our dataset.

  • Shorter sentences are then artificially extended by adding zero vectors (padding) until they all reach Nₘₐₓ words.

  • While this technically creates a fixed-size input for the ANN, it introduces significant issues:

    • Massive Computation: Imagine a vocabulary size of V=10,000 and a maximum sentence length of Nₘₐₓ = 100. The input vector size becomes 100 * 10,000 = 1,000,000 features. For a short, five-word sentence, 95% of this input is useless zero padding, resulting in redundant and unnecessary computation.

    • Inflexibility during Inference: If a user provides an input sentence that is longer than the Nₘₐₓ we trained on (e.g., 200 words), our model, trained on a fixed length of 100, fails entirely to process the input.

 

3. Disregarding Sequence and Semantic Meaning (The Core Problem)

 

This is the most crucial limitation. ANNs, by design, are feed-forward and stateless.

  • An ANN takes the entire sentence (all the stacked word vectors) as a single, large input block at once.

  • It has no inherent memory or mechanism to process information sequentially or retain context from one word to the next.

  • In language, the order of words dictates the meaning ("Dog bites man" vs. "Man bites dog"). By treating the entire input as a static, flat feature vector, the ANN loses the vital semantic and contextual information conveyed by the sequence.

In summary, ANNs are incapable of handling the varying size and retaining the sequential dependency that is fundamental to understanding data like text, time series, or speech. This is precisely the gap that RNNs are built to fill, making them the superior choice for sequence-based data tasks.


DATA FOR RNN


When we work with RNNs, the input must follow a specific structure: (timesteps, input_features).

For example, consider a sentiment analysis task where the input is a text review and the output is either positive (1) or negative (0).

 

Example Dataset

Review

Sentiment

Movie was good

1

Movie was bad

0

Movie was not good

0

 Since the input consists of English words, the model needs numerical representations.

One simple approach is one-hot encoding.

In these three reviews, we have five unique words:

  • movie

  • was

  • good

  • bad

  • not

So our vocabulary (corpus) size = 5.

Each word can be represented as a 5-dimensional one-hot vector:

  • movie → [1 0 0 0 0]

  • was → [0 1 0 0 0]

  • good → [0 0 1 0 0]

  • bad → [0 0 0 1 0]

  • not → [0 0 0 0 1]

Now, each review becomes a sequence of these vectors.

Example:

  • Review 1: Movie was good

    → [[1 0 0 0 0], [0 1 0 0 0], [0 0 1 0 0]]

This review has shape → (3, 5)

  • 3 timesteps (three words)

  • 5 input features (five-dimensional encoding)

Similarly:

  • Review 2 → (3,5)

  • Review 3 → (4,5)

When using Keras, the model expects data in the form:

(batch_size, timesteps, input_features)

If we pass all three reviews together, then:

  • batch_size = 3

  • timesteps = max sentence length = 4

  • input_features = 5

So final tensor shape → (3, 4, 5)

This explains how RNNs expect data to be fed.


HOW RNN WORKS

 

Using the same sentiment analysis example:

Review

Sentiment

Movie was good

1

Movie was bad

0

Movie was not good

0

Each sentence is represented as sequences of numerical vectors.

Let:

  • First sentence → x₁

  • Second sentence → x₂

  • Third sentence → x₃

Individual words inside a sentence can be denoted as:

  • x₁₁, x₁₂, x₁₃

  • x₂₁, x₂₂, x₂₃

  • x₃₁, x₃₂, x₃₃, x₃₄

 

RNN vs ANN: Two Core Differences

 

Although an RNN looks similar to a regular ANN (input → hidden → output), two fundamental differences make it unique:

 

1. How inputs are fed


ANNs receive the full input at once.

But RNNs receive inputs one timestep at a time.

For the sentence Movie was good:

  • At t = 1, input = x₁₁ (the first word)

  • At t = 2, input = x₁₂

  • At t = 3, input = x₁₃

Each timestep produces an output and updates the internal state.

 

2. Presence of a Feedback Connection (Recurrent State)

 

RNNs are not purely feed-forward.

The hidden layer has a loop — its output at time t-1 becomes input at time t.

This “memory” is what gives RNNs their power.

  • At t = 1, the state is usually initialized to zeros (or random values).

  • At t = 2, the hidden layer receives:

    • current input x₁₂

    • previous hidden state h₁ (from t = 1)

  • At t = 3, it receives: x₁₃ and h₂

This process continues for the entire sequence.

 

Network Structure and Trainable Parameters

 

Assume:

  • Input size = 5 (because each word is a 5-dim one-hot vector)

  • Hidden layer size = 3

  • Output size = 1 (sigmoid for sentiment)

Weights

  1. Input → Hidden weights (Wₓ)

    Shape = 5 × 3 → 15 weights

  2. Hidden → Hidden recurrent weights (Wₕ)

    Shape = 3 × 3 → 9 weights

  3. Hidden → Output weights (Wᵧ)

    Shape = 3 × 1 → 3 weights

Biases

  • Hidden layer bias → 3

  • Output layer bias → 1

Total biases = 4

 

Total Trainable Parameters

Total = 15 + 9 + 3 + 4 = 31

So this RNN architecture contains 31 trainable parameters.

 

RNN FORWARD PROPAGATION

Let’s understand how forward propagation works inside an RNN.

We will continue with the same sentiment analysis example, where each word is already converted into a 5-dimensional vector (like x₁₁ = [1 0 0 0 0]).

RNNs process the input one timestep at a time, and the forward pass follows a concept called “unrolling/unfolding through time.”

This simply means that the same recurrent layer is reused repeatedly — like a loop — for each word in the sequence.

 

Step-by-Step Forward Propagation

At t = 1, the first word x₁₁ (shape 1×5) enters the recurrent layer.

Input-to-Hidden Transformation

We multiply x₁₁ with the input weight matrix Wᵢ (shape 5×3):

x₁₁Wᵢ → (1 × 5)(5 × 3) = 1 × 3

Each of the 3 hidden nodes then applies an activation function (commonly tanh, but ReLU can also be used):

h₁ = tanh(x₁₁Wᵢ + b₁)

This gives us the hidden output at the first timestep → O₁ (shape 1×3).

 

t = 2

Now we pass the second word x₁₂ along with the previous hidden output O₁.

The key difference is that RNNs also use recurrent (feedback) weights, Wₕ (shape 3×3).

So the calculation becomes:

h₂ = tanh(x₁₂Wᵢ + O₁Wₕ + b₁)

  • x₁₂ Wᵢ → (1×5)(5×3) = 1×3

  • O₁ Wₕ → (1×3)(3×3) = 1×3

  • Adding them + bias → 1×3

Thus, we get the new hidden output O₂.

 

t = 3

For the third word x₁₃ :

h₃ = tanh(x₁₃Wᵢ + O₂Wₕ + b₁)

This generates O₃ (shape 1×3), which now contains information from all previous steps.

 

Final Output Layer

Once we get the last hidden output (O₃), we pass it through the output weights Wₒ (3×1):

O₃Wₒ → (1 × 3)(3 × 1) = 1 × 1

Then apply the sigmoid activation (since this is a binary sentiment prediction):

ŷ = σ(O₃Wₒ + bₒ)

This gives us the final sentiment output (0 or 1).

 

What about t = 1 ?

Only the later timesteps (t ≥ 2) receive two inputs:

  • Current word xₜ

  • Previous output Oₜ₋₁

But for consistency, we initialize:

Oₒ = 0 (1 x 3 vector)

So even at t = 1:

h₁ = tanh(x₁₁Wᵢ + O₀Wₕ + b₁)

This makes the computation uniform across timesteps.

 

Why is it called Recurrent ?

 

Because the same hidden layer is reused for every timestep.

  • The inputs change word by word

  • The weights stay the same

  • The hidden state carries information from earlier steps

This allows RNNs to process sequential data and maintain memory over multiple timesteps (though simple RNNs retain only short-term memory).

 

SIMPLIFIED REPRESENTATION

You can visualize the RNN cell as a box that receives two inputs:

  1. The current word vector → xᵢₜ

    • (i = sentence index, t = timestep)

  2. The previous hidden output → Oₜ₋₁

Inside the RNN cell:

  1. Compute input contribution:

    xᵢₜWᵢ

  2. Compute recurrent contribution:

    Oₜ₋₁Wₕ​ 

  3. Add them and apply activation:

    Oₜ = tanh(xᵢₜWᵢ + Oₜ₋₁Wₕ + b₁)

If this Oₜ is from the last timestep, it is sent to the output layer:

ŷ = σ(OₜWₒ + bₒ)

This final value becomes the model’s prediction.


bottom of page