From RNNs to GPT: The Epic History and Evolution of Large Language Models (LLMs)

Aryan
Feb 8
11 min read

SEQUENCE TASKS AND THEIR TYPES

The seq2seq story begins with Recurrent Neural Networks (RNNs) and the concept of sequential data. RNNs are specifically designed to work with data where order matters. Sequential data includes examples such as language, sentences, time series, and bioinformatics data.

There are three major types of RNN-based sequence mappings.

The first type is many-to-one. In this case, the input data has a sequence, but the output does not. The output is a single scalar value. A common example is sentiment analysis, where the input is a sequence of words (such as a movie review), and the model predicts a single label like positive or negative.

The second type is one-to-many. Here, the input is a single value with no sequence, while the output is a sequence. A classic example is image captioning, where an image is given as input and a sequence of words describing the image is generated as output.

The third type is many-to-many. In this case, both input and output are sequences. This category can be further divided into two types.

The first is synchronous many-to-many, where the length of the input sequence is equal to the length of the output sequence. Examples include part-of-speech tagging and named entity recognition, where each input token corresponds to an output label.

The second is asynchronous many-to-many, where the input and output sequences are of different lengths. A common example is machine translation, such as language translation in tools like Google Translate. This asynchronous nature makes the problem more complex.

Seq2seq models were specifically designed to solve this asynchronous many-to-many sequence problem. To address this challenge effectively, sequence-to-sequence architectures were introduced.

SEQ2SEQ TASKS

Seq2seq models are used in several important tasks.

The first is text summarization, where we provide a chunk of text as input and generate a shorter summary. The input text is a sequence, and the summary is also a sequence.

The second task is question answering, where a user asks a question and the model generates an appropriate answer. These models are typically trained using large knowledge bases.

The third task is chatbots or conversational AI. These are dialogue-based systems where both the input and the response are text sequences.

The fourth task is speech-to-text, which involves converting spoken language into written text, again mapping one sequence into another.

HISTORY OF SEQ2SEQ MODELS

We can divide the history of seq2seq models into five important stages.

The first solution proposed to solve seq2seq tasks was the encoder–decoder architecture. This was the earliest attempt to address sequence-to-sequence problems, but it had several limitations.

The second major improvement came with the attention mechanism.

The third stage introduced transformers.

The fourth stage focused on transfer learning.

And the final stage led to the development of large language models (LLMs).

We will discuss each of these stages one by one.

ENCODER–DECODER ARCHITECTURE

The encoder–decoder architecture dates back to 2014, originating from research at Google. A team led by Ilya Sutskever, who is also a co-founder of OpenAI and a very prominent figure in deep learning, published a paper titled “Sequence to Sequence Learning with Neural Networks.” This paper became a seminal and highly influential work in the field.

In this paper, the authors pointed out that sequence-to-sequence problems were not being solved effectively with existing approaches. They proposed a new architecture, which they called the encoder–decoder network, to better handle such tasks.

In seq2seq problems, we are given an input sequence and we want to generate an output sequence, such as in machine translation. For example, if we input a sentence like “I love India”, the model should translate it into Hindi as “Mujhe Bharat se pyar hai.”

The proposed solution was simple and elegant. The architecture consists of two main components: an encoder and a decoder.

The encoder processes the input sequence word by word and compresses the information into a fixed-length representation, often called a context vector. This compressed representation is then passed to the decoder.

The decoder takes this compressed information and generates the output sequence step by step, producing one word at a time. Both the encoder and decoder are implemented using RNN-based cells, typically LSTM cells. Although GRUs can also be used, the original paper worked with LSTMs.

In LSTM-based encoding, one word is fed at each time step. The internal states, namely the hidden state and cell state, are updated continuously. As more words are processed, the network gradually summarizes the entire sentence. When the final input word is processed, the encoder produces a single compressed representation of the complete input sequence.

This compressed vector is then fed into the decoder, which also uses LSTM cells. The decoder generates the output sequence in a step-by-step manner, predicting one word after another. This is the basic working of the encoder–decoder architecture.

The model performs reasonably well when the input sentences are short. However, when the input becomes longer, such as an entire paragraph, the output quality degrades significantly. The generated translation may lose meaning, sentiment, or overall context.

The main reason for this issue lies in the architecture itself. The encoder compresses the entire input sequence into a single context vector, and the entire burden of information is placed on this fixed-length representation. When the input sequence is too long, information loss occurs, similar to short-term memory limitations. As a result, decoding becomes inaccurate.

Another key limitation is that the decoder’s output heavily depends on the final time-step context vector produced by the encoder. This bottleneck leads to poor translation quality for long sequences.

To overcome these limitations, a new mechanism was introduced, known as the attention mechanism.

ATTENTION MECHANISM

In 2015, Yoshua Bengio and his team published the paper “Neural Machine Translation by Jointly Learning to Align and Translate.” This was the first work that formally introduced the attention mechanism.

The attention-based encoder–decoder model is still an encoder–decoder architecture, but with a crucial difference. The encoder remains the same; the key change lies in the decoder block.

In the traditional encoder–decoder architecture, the encoder produces a single context vector after processing the final input time step. This compressed vector is then passed to the decoder, which generates the output sequence step by step.

In contrast, the attention-based model does not rely on a single fixed context vector. Instead, the decoder has access to all internal states of the encoder. The hidden states hₜ and cell states cₜ produced at every time step of the encoder are available to the decoder throughout the decoding process. In the traditional approach, only the final compressed context vector is available, whereas in the attention-based model, the decoder has full access to the entire input sequence representation.

Because of this, when predicting a particular output word, the model has access to the full sentence context, rather than a single compressed summary. However, this introduces a new question: when all encoder states are available, which states are most useful for predicting the current word? This is where attention comes into play.

Attention is a mechanism that dynamically selects the most relevant encoder hidden states for a given decoder time step. For each decoder step, attention assigns weights to the encoder hidden states based on their relevance to the current prediction. These weighted states are then combined to form a context vector specific to that decoder step. The attention mechanism itself is a neural network whose role is to identify the most informative encoder states for predicting the current output word.

In this architecture, the first part is the encoder block and the second part is the decoder block. The input is fed into the encoder one token at a time, and at each time step, hidden states are generated and stored.

When the decoder generates the first output word, it takes all encoder hidden states and passes them through the attention layer. The attention layer determines which hidden states are most useful, converts this information into a context vector, and produces the output word.

For the second output word, the decoder again uses all encoder hidden states through the attention layer, computes a new context vector, and predicts the next word. This process continues for every decoder time step.

In the traditional encoder–decoder architecture, the same single context vector is used for all decoding steps. In contrast, the attention-based model computes a different context vector at every decoder time step. This ensures that important information from the input sequence is not lost, which is the main advantage of the attention mechanism.

TRANSFORMERS

One drawback of the attention-based encoder–decoder architecture is its computational cost. For every decoder time step, a new context vector is computed, which increases training time and computation. Since all encoder hidden states are considered at each decoding step, the cost grows significantly.

If the input sequence has n words and the output sequence has m words, the model performs n×m attention computations, resulting in quadratic complexity. For long sentences, this becomes a major bottleneck.

Between 2015 and 2017, several variations of attention were proposed to address this issue. Eventually, researchers realized that the core limitation was not attention itself, but the sequential nature of LSTMs. LSTMs process tokens one step at a time, which prevents parallel computation.

To overcome this, researchers explored architectures that could remove this sequential dependency and enable parallel processing. This led to the introduction of transformers. In 2017, Google Brain published the paper “Attention Is All You Need,” which became highly influential.

The biggest change introduced by transformers is the complete removal of LSTMs and RNNs. Transformers rely entirely on attention mechanisms, specifically self-attention.

The transformer architecture still consists of an encoder and a decoder, but neither uses RNN cells. Instead, both blocks are built using self-attention layers, fully connected dense layers, normalization layers, and embedding layers.

The most important advantage of transformers is that they can process all tokens simultaneously, rather than one word at a time. This parallel processing removes the biggest bottleneck of earlier encoder–decoder models. As a result, training becomes significantly faster.

Transformers introduced a groundbreaking architecture that changed the future of NLP. Models could now be trained in a fraction of the time and cost, with reduced hardware requirements. This architectural shift laid the foundation for modern large-scale language models.

TRANSFER LEARNING

Training a transformer model from scratch is very costly and time-consuming. First, it requires high-quality GPUs, which significantly increases infrastructure cost. Second, even though transformers are faster than RNN-based models, training still takes a substantial amount of time. Third, data availability is a major challenge. For example, if we want to perform sentiment analysis using a transformer, we need a large amount of labeled data. In most real-world scenarios, we do not have such massive datasets. With limited data, training a transformer from scratch does not produce good results. Due to these constraints, most people were not able to use transformers directly.

This is where transfer learning became important. In 2018, a landmark research paper by Jeremy Howard and Sebastian Ruder, titled “Universal Language Model Fine-tuning for Text Classification (ULMFiT)”, was published. This paper demonstrated that transfer learning can be effectively applied to the NLP domain. Before this work, transfer learning had been highly successful in computer vision, but it was widely believed that it could not be applied to NLP tasks. ULMFiT challenged this belief and provided a clear framework showing how transfer learning works in NLP.

Before going deeper, let us first understand what transfer learning is.

Transfer learning is a technique in which knowledge learned from one task is reused to improve performance on a related task. For example, in image classification, a model trained to recognize cars can reuse that knowledge to recognize trucks. In practice, we take a pretrained model, keep its learned weights, remove the final task-specific layers, and retrain only those layers on new data. This process is called fine-tuning.

A common example comes from computer vision. Models pretrained on ImageNet, which contains millions of images, learn general features such as edges, textures, and shapes. Later, these models can be fine-tuned on a smaller dataset for a specific task. Instead of retraining the entire network, we reuse most of the pretrained weights and retrain only a few layers. This is the core idea behind transfer learning.

Why transfer learning was difficult in NLP

Initially, transfer learning was not considered very useful in NLP due to two major reasons.

The first reason was task specificity. NLP tasks such as sentiment analysis, named entity recognition (NER), part-of-speech tagging, and machine translation were considered fundamentally different. Each task had unique requirements, and it was believed that a model trained on one task would not perform well on another.

The second reason was the lack of labeled data, especially for tasks like machine translation. For example, English-to-Hindi translation requires a large amount of aligned, labeled sentence pairs. Such datasets are expensive and difficult to collect. Because of these limitations, transfer learning did not initially establish itself in NLP.

This changed in 2018 with the introduction of ULMFiT. Instead of using machine translation for pretraining, the authors used a different task called language modeling.

Language modeling is an NLP task where the model learns to predict the next word in a sentence. Given a sequence of words, the model learns the probability distribution of the next word. This task turned out to be highly effective as a pretraining objective.

Language modeling works well for pretraining for two main reasons.

First, it enables rich feature learning. Even though next-word prediction seems simple, the model learns grammar, syntax, semantics, and even some level of world knowledge. For example, in a sentence like “The hotel was exceptionally clean, yet the service was ___”, the model can infer that a negative word such as “bad” or “pathetic” is likely to follow. This shows that the model understands context and sentiment.

Second, language modeling benefits from the huge availability of data. Unlike machine translation, language modeling does not require labeled data. Any text source—PDFs, books, articles, or web pages—can be used. This makes it a form of unsupervised pretraining. Because of these advantages, language modeling was chosen as the primary pretraining task in ULMFiT.

ULMFiT setup and results

In their setup, the authors used a variant of LSTM called AWD-LSTM. They trained this model on large amounts of Wikipedia text using unsupervised language modeling, where the objective was to predict the next word.

After pretraining, they replaced the output layer with a classifier and fine-tuned the model on different datasets such as IMDb movie reviews and Yelp reviews. After fine-tuning, the model achieved state-of-the-art performance.

The results were revolutionary. A fine-tuned model trained on only a few hundred samples outperformed models trained from scratch on thousands of samples. This clearly demonstrated the power of transfer learning in NLP.

At that time, the transformer architecture had just been introduced, and these developments were happening in parallel. By 2018, NLP had powerful architectures like transformers and strong training strategies based on transfer learning.

LLMs

In 2018, around ten months after the ULMFiT paper, two major transfer-learning-based language models were introduced: BERT by Google and GPT by OpenAI. Both models were pretrained as language models and showed excellent transfer learning capabilities. With fine-tuning, they could be adapted to a wide range of NLP tasks.

The key difference between the two was architectural. BERT is an encoder-only model, while GPT is a decoder-only model. With the release of these pretrained models, researchers and practitioners could download them and fine-tune them on smaller datasets, achieving very strong results. This marked a major transformation in the NLP field.

As newer versions of GPT were developed, the models became increasingly large, with billions of parameters trained on massive datasets. Because of their scale, these models came to be known as Large Language Models (LLMs).

There are several defining characteristics of LLMs.

First is data scale. LLMs are trained on extremely large datasets containing billions of words. For example, GPT-3 was trained on approximately 45 terabytes of text data, sourced from books, websites, and other internet content. Data diversity is crucial to reduce bias.

Second is training hardware. Training LLMs requires large GPU clusters, often consisting of thousands of high-end GPUs, along with massive memory and high-speed interconnects. This demands significant hardware investment.

Third is training time. Training on tens of terabytes of data takes days or even weeks, even with powerful hardware.

Fourth is cost. Training LLMs involves high expenses related to hardware, electricity, infrastructure, and highly skilled engineers. In practice, training such models costs millions of dollars, which makes it feasible mainly for large companies, governments, or major research institutions.

Fifth is energy consumption. Training LLMs consumes enormous amounts of energy. For example, training a GPT-3–scale model with 175 billion parameters can consume electricity comparable to that used by a small city over a month.

Because of this massive scale in data, compute, cost, and energy, these models are called Large Language Models.

From RNNs to GPT: The Epic History and Evolution of Large Language Models (LLMs)

Recent Posts

© 2025 Aryan Upadhyay |