The Evolution of Object Detection: Fast R-CNN and Faster R-CNN Explained

Aryan
Feb 27
11 min read

Fast R-CNN

DISADVANTAGES OF RCNN

While R-CNN was a breakthrough, it had significant flaws that made it impractical for real-world use.

Training is a Multi-Stage Pipeline:
It wasn't a single model; it was a complex chain of three different models trained separately:
- Stage 1: Fine-tune a CNN on region proposals (Log Loss).
- Stage 2: Train SVMs to classify the features (Hinge Loss).
- Stage 3: Train Bounding Box Regressors (L2 Loss).
  This made the pipeline difficult to manage and optimize.
Training is Expensive (Space & Time):
To train the SVMs and Regressors, we had to extract features from every single region proposal (2,000 per image) and save them to the disk.
- For deep networks like VGG16, this required hundreds of gigabytes of storage.
- It was extremely slow (taking ~2.5 GPU-days just for 5k images).
Inference is Extremely Slow:
At test time, the model had to run the full CNN forward pass 2,000 times for a single image (once for each proposal).
- Latency: It took ~47 seconds to process one image on a GPU. This is far too slow for real-time applications (like self-driving cars).

The Solution: "Look Once"

Scientists realized the bottleneck was the repetitive CNN processing. In R-CNN, if two region proposals overlapped, we re-calculated the CNN features for that overlapping area twice. This is redundant.

The Fast R-CNN Idea:

Instead of cropping the image 2,000 times and passing each crop to the CNN, why not pass the whole image to the CNN just once?

Single Forward Pass: We pass the entire raw image through a pre-trained "Backbone CNN" (like VGG16) only one time.
Feature Map Generation: The CNN outputs a Feature Map (a compressed, high-level representation of the image).
Projection: We still run Selective Search on the original image to find Region Proposals, but we project these coordinates onto the Feature Map.
- Example: If a proposal is at (x, y) on the image, we find the corresponding coordinates (x', y') on the small feature map.

This means we calculate the expensive convolutional features only once per image, saving massive amounts of computation.

The New Problem: Variable Sizes

We now have extracted distinct regions from the Feature Map corresponding to our objects.

The Issue: The Fully Connected (FC) layers at the end of the network require a fixed input size (e.g., a vector of length 4096).
The Reality: Our region proposals are all different shapes and sizes (some are tall, some wide, some small). When we crop these from the feature map, the resulting feature tensors are also of different sizes.

In standard CNNs, we used Flatten(), but you cannot flatten tensors of different sizes into the same fixed vector. We need a way to standardize them.

Region of Interest (RoI) Pooling

To solve the variable size problem, Fast R-CNN introduced a special layer called RoI Pooling.

What it does:

RoI Pooling accepts feature map regions of any size and converts them into a fixed spatial extent (e.g., 7 x 7).

How it works:

Input: A feature map region of size h x w (variable).
Target: A fixed output size H x W (e.g., 7 x 7).
Grid Division: The layer divides the region of interest into a grid of H x W sub-windows.
- The size of each sub-window is approximately h/H x w/W.
Max Pooling: It performs Max Pooling inside each sub-window, taking the largest value.
Output: Regardless of whether the input was 20 x 20 or 100 x 50, the output is always a fixed 7 x 7 feature map.

This fixed output can now be flattened and passed smoothly into the Fully Connected layers for classification and bounding box regression.

Fast R-CNN Improvements

Feature	R-CNN	Fast R-CNN
Input to CNN	~2,000 cropped images per scene	1 whole image per scene
Feature Extraction	Repeated 2,000 times	Performed once (shared computation)
Output Layer	SVMs (External model)	Softmax (Built into network)
Box Regression	External Linear Regressor	Built into network (Multi-task loss)
Speed	~47 seconds/image	~2 seconds/image (approx 25x faster)

Fast R-CNN essentially integrated the feature extraction, classification, and regression into a single, end-to-end trainable network, making it faster and more accurate. However, it still relied on the slow "Selective Search" algorithm for finding proposals, which leads us to the next evolution: Faster R-CNN.

How RoI Pooling Actually Works

We know that Region of Interest (RoI) Pooling converts variable-sized regions into a fixed-sized output (e.g., 2 x 2 or 7 x 7). But how exactly does the math work, and where do errors creep in ?

1. The Mapping Problem (Stride & Quantization)

Imagine our input image is 400 x 400, and after passing through the CNN (which downsamples the image), we get a Feature Map of size 10 x 10.

Stride (Scale Factor): The ratio is 400 / 10 = 40. This means every 1 pixel on the feature map represents 40 pixels on the original image.

Scenario A (Perfect Alignment):

If we have a bounding box of size 200 x 200 on the image:

On the feature map, it becomes 200 / 40 = 5 x 5.
This is a clean integer. No problem.

Scenario B (The "Quantization" Issue):

Real-world bounding boxes aren't always multiples of 40. Suppose we have a box of size 210 x 210:

Calculation: 210 / 40 = 5.25.
The Problem: We cannot select "5.25 pixels." A pixel is atomic.
The Solution: We are forced to round off (quantize) the coordinates (e.g., round 5.25 down to 5).
The Consequence: This rounding introduces a misalignment between the feature map box and the actual object. While Fast R-CNN handles this using standard rounding, this "misalignment" becomes a major issue for pixel-perfect tasks (like segmentation), which is solved later in Mask R-CNN using "RoI Align."

2. The Pooling Operation

Once we have our quantized region on the feature map (say, 5 x 5), and we need a fixed output of 2 x 2:

Grid Division: We divide the 5 x 5 region into a 2 x 2 grid of sub-windows. Since 5 isn't divisible by 2, the sub-windows will be uneven (some 2 x 2, some 3 x 2).
Max Pooling: We take the maximum value from each sub-window.
Result: We get a clean, fixed 2 x 2 output that can be flattened and sent to the Fully Connected layers.

Handling Scale Invariance

One big question in object detection is: How do we detect both tiny cars and massive trucks?

The Multi-Scale Approach (Image Pyramids)

Historically, to handle different object sizes (scale invariance), researchers used Image Pyramids.

Method: Take the original image (400 x 400) and resize it multiple times (e.g., 300 x 300, 200 x 200, 100 x 100).
Process: Pass all these resized versions through the CNN separately.
Logic: The network will "see" the object at different scales, making it easier to detect.

The Fast R-CNN Verdict

The creators of Fast R-CNN experimented with this.

Cost: Processing 5 different scales takes 5 times longer.
Benefit: They found that deep CNNs are actually quite good at learning scale naturally. The accuracy gain from using image pyramids was marginal compared to the massive increase in computation time.

Conclusion:

Fast R-CNN discarded the multi-scale pyramid in favor of Single-Scale Training. We simply resize the image once (usually so the shortest side is 600 pixels) and pass it through. This offers the best trade-off: it is drastically faster with almost no loss in accuracy.

Fast R-CNN

Input: Image + Region Proposals (Selective Search).
Backbone: Pass image once to get Feature Map.
Projection: Map proposals to Feature Map.
RoI Pooling: Convert mapped regions to fixed size (handling quantization).
Heads:
- Softmax: Classifies object (replacing SVM).
- Regressor: Refines box coordinates.
Result: A fast, end-to-end trainable model that fixes the speed bottlenecks of the original R-CNN.

Faster R-CNN

Need for Faster R-CNN

Before Faster R-CNN, region proposals were generated using Selective Search. This method typically produced around 2000 region proposals per image by grouping pixels based on low-level similarity cues such as texture, color, and segmentation patterns.

However, Selective Search had two major limitations:

It was not learnable — it relied on hand-crafted rules rather than learning from data.
It was a computational bottleneck — it ran separately from the CNN (often on CPU), making the overall pipeline slow.

Because proposal generation was external to the neural network, the detection system could not be trained end-to-end.

Faster R-CNN solves this problem by introducing a Region Proposal Network (RPN) — a learnable module that generates region proposals directly from convolutional feature maps. This removes the external bottleneck and allows proposal generation and detection to be trained jointly in a unified framework.

Architecture of Faster R-CNN

The Faster R-CNN pipeline can be understood as a unified two-stage architecture:

Image → Backbone CNN → Feature Map → RPN → Region Proposals → RoI Pooling → Fully Connected Layers → Two Branches (Classification + Bounding Box Regression)

Step 1: Backbone Network

The input image is first passed through a backbone CNN such as VGG16 or ResNet.

This network extracts a deep convolutional feature map that encodes high-level visual information.

Step 2: Region Proposal Network (RPN)

Instead of using Selective Search, Faster R-CNN introduces the Region Proposal Network on top of the shared feature map.

The RPN:

Takes the backbone feature map as input
Generates candidate object regions (region proposals)
Uses learnable convolutional layers
Replaces the fixed, non-learnable Selective Search algorithm

This is the key structural change from Fast R-CNN.

Step 3: RoI Pooling

The region proposals generated by the RPN are passed to the RoI Pooling layer along with the shared feature map.

RoI Pooling:

Converts variable-sized region proposals into fixed-size feature maps (e.g., 7 × 7)
Enables the network to process proposals through fully connected layers

Step 4: Final Detection Head

After RoI Pooling, the features are passed through fully connected (FC) layers and split into two branches:

Classification Branch – predicts the object class.
Bounding Box Regression Branch – refines the box coordinates.

Shared Computation: The Core Improvement

In earlier R-CNN variants, the CNN was applied multiple times (once per proposal). This made the system extremely slow.

In Faster R-CNN:

The backbone CNN is computed only once.
The feature map is shared between the RPN and the final detection head.

This shared architecture significantly reduces computation and makes the system efficient.

Training and Loss

Loss is computed at two places:

RPN Loss
- Objectness loss (foreground vs background)
- Bounding box regression loss (anchor refinement)
Detection Head Loss
- Multi-class classification loss
- Bounding box regression loss

These losses are combined into a multi-task loss, and gradients are backpropagated through the entire shared network.

This means:

The RPN branch updates the backbone.
The detection branch also updates the backbone.
The whole architecture is trained jointly in an end-to-end manner.

Why RPN is Better Than Selective Search

Selective Search:

Produces ~2000 proposals per image
Based purely on similarity heuristics (texture, color, segmentation)
No learning phase
Runs separately from the CNN

Region Proposal Network:

Fully learnable
Uses convolutional layers
Generates proposals directly from feature maps
Shares weights with the backbone
Runs efficiently on GPU

Instead of relying on similarity-based grouping, the RPN learns to predict where objects are likely to appear. This makes Faster R-CNN both faster and more accurate.

Faster R-CNN is not just a speed improvement — it is a structural redesign.

By replacing Selective Search with a learnable Region Proposal Network and introducing shared convolutional computation, Faster R-CNN becomes a unified, end-to-end trainable two-stage detector.

This design is what made it a foundational architecture in modern object detection.

Anchor Boxes and RPN Internals

Sliding Window Intuition on the Feature Map

Assume the backbone produces a feature map of size 60 × 40.

To generate region proposals, the RPN applies a 3 × 3 convolution over this feature map.

At each spatial location:

A 3 × 3 window captures local context.
This window is mapped to a 1 × 1 × C feature vector (where C is the number of channels).
This 1 × 1 unit represents information from a specific receptive field in the original image.

The intuition is simple:

Each spatial location on the feature map may correspond to an object centered around that region in the original image.

Why Anchor Boxes Are Needed

A single fixed bounding box at each location is not enough. Objects can vary significantly in:

Size (small vs large)
Shape (square, tall, wide)

To handle this variation, we introduce Anchor Boxes.

At every spatial location (each 1 × 1 unit), we generate multiple reference boxes:

3 different scales (small, medium, large)
3 different aspect ratios (1:1 square, 1:2 tall, 2:1 wide)

So total anchors per location:

3 scales × 3 aspect ratios = 9 anchor boxes

These boxes share the same center point — which acts as the anchor — but differ in shape and size.

This increases the probability that at least one anchor overlaps well with a ground-truth object.

Total Number of Anchors

If the feature map size is:

60 × 40

And we generate 9 anchors per location:

60 × 40 × 9 = 21,600 anchors

So effectively, for every spatial cell in the feature map, we generate 9 candidate bounding boxes.

Removing Invalid Anchors

Now assume the original image size is 1000 × 600.

When these 21,600 anchors are projected back onto the original image:

Some anchors will extend beyond image boundaries.
These cannot be used.

So in the first filtering step:

We remove anchors that fall outside the image.
The count typically reduces from ~21,000 to around 6,000 valid anchors.

This is a major reduction in one step.

Inside the RPN: Two Output Branches

At each spatial location, after the 3 × 3 convolution, the RPN predicts outputs for all k anchors (usually k = 9).

For each location:

1. Classification Branch

Each anchor performs binary classification:

Foreground (object)
Background (no object)

For k anchors:

We need 2k output channels
For k = 9 → 18 outputs per location

These are objectness scores.

2. Regression Branch

Each anchor also predicts bounding box refinements.

For each anchor:

4 coordinate offsets (Δx, Δy, Δw, Δh)

For k anchors:

We need 4k output channels
For k = 9 → 36 outputs per location

These offsets adjust the anchor box to better match the ground truth.

Why Is This Still Two-Stage?

A common question:

If the RPN already predicts objectness and bounding boxes, why do we still need RoI Pooling and the second stage?

Because the RPN is class-agnostic.

It only answers:

“Is there an object here?”

It does not classify whether it is a car, dog, person, etc.

The RPN outputs are treated as proposals, which are passed to:

RoI Pooling → Fully Connected Layers → Final Classification + Bounding Box Refinement

This is why Faster R-CNN remains a two-stage detector.

Anchor Labeling Using IoU

After filtering, we have around 6,000 valid anchors.

During training, we label them using Intersection over Union (IoU) with ground-truth boxes.

Labeling rules:

IoU ≥ 0.7 → Positive (Foreground)
IoU ≤ 0.3 → Negative (Background)
0.3 < IoU < 0.7 → Ignored

After labeling:

We may have around 4,000 labeled anchors (positives + negatives).

Instead of training on all of them:

We sample a mini-batch of 256 anchors per image
Maintain a balance between positive and negative samples

These sampled anchors are used to train the RPN.

Evolution: R-CNN → Fast R-CNN → Faster R-CNN

R-CNN

Image → Selective Search → Extract region crops → CNN → SVM + Regressor

Selective Search is not learnable
CNN runs thousands of times
Extremely slow

Fast R-CNN

Image → CNN (once) → Selective Search → RoI Pooling → FC Layers

CNN computation is shared
Still relies on slow, external Selective Search

Faster R-CNN

Image → Shared CNN → RPN → RoI Pooling → Detection Head

Replaces Selective Search with RPN
Proposal generation becomes learnable
Shares convolutional features
Runs efficiently on GPU
Fully end-to-end trainable

Faster R-CNN is a structural improvement, not just a speed tweak.

By:

Introducing anchor boxes
Learning region proposals via RPN
Sharing convolutional computation
Training everything jointly

It transforms object detection into a unified, optimized two-stage architecture.

That design decision is what made Faster R-CNN a cornerstone model in modern object detection.

Performance Comparison: R-CNN vs Fast R-CNN vs Faster R-CNN

To understand the real impact of Faster R-CNN, it helps to compare the numbers. The main bottleneck in earlier versions was Selective Search, which was CPU-based and non-learnable. Faster R-CNN eliminates this bottleneck by introducing the GPU-based, learnable Region Proposal Network (RPN).

Proposal Generation Method

R-CNN: Selective Search
Fast R-CNN: Selective Search
Faster R-CNN: Region Proposal Network (RPN)

Proposal Generation Time

R-CNN: ~1500 ms
Fast R-CNN: ~1500 ms (Selective Search remains the bottleneck)
Faster R-CNN: ~10 ms

The dramatic improvement comes from replacing the external CPU-based algorithm with a learnable network operating on shared feature maps.

Prediction Time per Image

R-CNN: ~47,000 ms
Fast R-CNN: ~320 ms
Faster R-CNN: ~190 ms

R-CNN is extremely slow because it runs a CNN separately on approximately 2000 region proposals per image.

Fast R-CNN improves speed by computing the CNN only once per image.

Faster R-CNN further reduces time by integrating proposal generation into the network.

Speedup Factor (Relative to R-CNN)

R-CNN: 1× (baseline)
Fast R-CNN: ~25× faster
Faster R-CNN: ~250× faster

The transition from Fast R-CNN to Faster R-CNN was not just a performance tweak — it was a structural change. By removing the final non-learnable component (Selective Search) and replacing it with a trainable RPN that shares convolutional features, Faster R-CNN achieves near real-time detection without sacrificing accuracy.

All three architectures are two-stage detectors. The major evolution lies in replacing hand-crafted region proposals with a fully learnable, end-to-end proposal mechanism integrated directly into the network.

The Evolution of Object Detection: Fast R-CNN and Faster R-CNN Explained

Recent Posts

© 2025 Aryan Upadhyay |