R-CNN Explained: A Comprehensive Guide to Object Detection Architecture
- Aryan

- Feb 24
- 16 min read
What is Object Detection ?

Imagine we have an image containing a car. When we pass this image to a standard model, we usually get a simple label saying "Car." However, in Object Detection, we need more than just a label. We expect the model to output a bounding box around the object along with its class name and a confidence score.
For example, if the model detects a tree, it shouldn't just say "Tree." It should draw a box around the tree and tell us, "I am 90% confident this is a tree."
Object detection is a core task in Computer Vision where we do two things simultaneously:
Classify what objects are in the image.
Localize exactly where those objects are located.
Classification vs. Localization vs. Detection
To understand R-CNN, we must distinguish between these three concepts:
Image Classification:
This is what standard CNNs (like VGG16, AlexNet, ResNet) do.
Input: An image.
Output: A single label describing the main content (e.g., "Dog," "Cat," "House").
Limitation: It tells us what is in the image, but not where it is.
Object Localization:
This involves classification plus a single bounding box.
Scenario: The image contains only one object.
Output: The class label + coordinates of the single object.
Object Detection:
This is the most complex task.
Scenario: The image may contain multiple objects of different classes (e.g., a person walking a dog near a car).
Output: We must detect all objects, classify them, and draw bounding boxes around every single one, regardless of their shape, size, or position.
The Challenge: Why Not Just Use Normal CNNs ?
A standard CNN (like the ones used for classification) gives us probability scores for classes present in the image. However, it fails to provide two critical pieces of information needed for detection:
Position: Where exactly is the object?
Count: How many objects are there?
Our objective with R-CNN is to modify standard architectures so they can predict both the class label and the position (bounding box) for every object in the scene.
Bounding Boxes and Representation

What is a Bounding Box ?
Assume we have an image containing a person running. To build an object detection model, we need to define exactly where that person is. We do this using a Bounding Box.
A bounding box is essentially a rectangle that tightly encloses the object of interest. While objects in the real world have arbitrary shapes, in computer vision, we generally use rectangles because they are computationally efficient and easy to work with.
Specifically, we focus on Axis-Aligned Bounding Boxes. This means the edges of the box are always parallel to the vertical (y) and horizontal (x) axes of the image; they are never rotated.
How Do We Represent a Bounding Box ?
There are two primary ways to mathematically represent these boxes. Depending on the dataset or architecture (like COCO or YOLO), the format changes, but we can easily convert between them.
1. The Corner Format (x₁, y₁, x₂, y₂)
In this format, we define the box using two specific points:
(x₁, y₁): The coordinates of the Top-Left corner.
(x₂, y₂): The coordinates of the Bottom-Right corner.
2. The Center Format (cₓ, c_y, w, h)
In this format, we define the box using its center and its size:
(cₓ, c_y): The coordinates of the Center of the bounding box.
w: The Width of the box.
h: The Height of the box.
Calculating Coordinates (The Math)
It is crucial to understand how to convert between these two formats, as different models require different inputs.
Scenario A: Converting Center to Corners
If you have the center (cₓ, c_y) and the dimensions (w, h), how do you find the corners?
Since the center is exactly in the middle:
The left edge is the center minus half the width.
The top edge is the center minus half the height.

Scenario B: Converting Corners to Center
If you have the corners (x₁, y₁) and (x₂, y₂), how do you find the center and size?
Width is the difference between the right and left x-coordinates.
Height is the difference between the bottom and top y-coordinates.
The center is the average of the coordinates.
w = x₂ − x₁
h = y₂ − y₁

Standard Data Formats: COCO vs. YOLO

Different datasets use different standards to list these numbers. Two of the most common formats you will encounter are COCO and YOLO.
1. COCO Format
The Common Objects in Context (COCO) dataset uses a "Top-Left + Dimensions" format.
Structure: [x_min, y_min, width, height]
It gives you the top-left corner coordinates and the total size of the box. These are usually absolute pixel values.
2. YOLO Format
The YOLO (You Only Look Once) architecture uses a "Normalized Center" format.
Structure: [x_center, y_center, width, height]
Crucial Difference: In YOLO, these values are normalized between 0 and 1 relative to the total image size.
x_cₑₙₜₑᵣ = Absolute x_cₑₙₜₑᵣ / Image Width
w = Absolute Width / Image Width
The goal is always the same: create a box that covers the full content of the object as tightly as possible. Whether we use corners or centers, the math allows us to switch back and forth depending on what our model needs.
The Precursor to R-CNN: Single-Object Localization
Before diving into complex object detection (finding multiple objects), let's look at how we modify a standard CNN to handle Single-Object Localization.
The Architecture Setup

Assume we have a standard RGB image (e.g., size 224 x 224) that contains exactly one object (e.g., a car).
A standard CNN (like VGG or ResNet) ends with a Flatten layer followed by a Dense layer for classification. To enable localization (finding the bounding box), we modify the architecture after the Flatten layer by splitting it into two separate branches (heads):
The Classification Head:
Goal: Predict the class label (e.g., Car, Dog, Background).
Structure: A fully connected (Dense) layer matching the number of classes.
Loss Function: Cross-Entropy Loss (standard for classification).
The Regression Head (Bounding Box):
Goal: Predict the coordinates of the object.
Structure: A fully connected (Dense) layer with exactly 4 neurons.
Output: The values (x, y, w, h) corresponding to the bounding box.
Loss Function: Mean Squared Error (MSE) or L2 Loss (standard for regression).
Multi-Task Loss
Since the network is doing two things at once, we need a combined loss function to train it. We calculate the total loss as a weighted sum of both individual losses:
Total Loss = Classification Loss + α . Regression Loss
During backpropagation, the model updates its weights to minimize both errors simultaneously. It learns to recognize the object and draw a box around it at the same time.
The Major Limitation
This architecture works perfectly if—and only if—there is exactly one object in the image.
The Problem:
The architecture has a fixed output size (one class label, four coordinates).
If an image contains two or more objects, this model fails because it cannot output multiple sets of coordinates dynamically. It doesn't know how many boxes to predict.
The Solution:
To detect multiple objects, we need a method that can scan the image and handle a variable number of objects. This leads us to the Sliding Window Technique and eventually R-CNN.
The Sliding Window Technique
To solve the multi-object problem, researchers initially proposed the Sliding Window approach.
Instead of passing the whole image at once, we take a "window" (a small crop of the image) and slide it across the entire image. We pass each crop to the model individually.
If the window contains a car, the model classifies it as "Car."
If the window contains nothing, the model classifies it as "Background."
By sliding this window over every position, we can theoretically find all objects. However, as we discussed earlier, this is computationally expensive and inefficient, which is why we use Region Proposals (Selective Search) in R-CNN instead.
The Evolution of Detection: From Sliding Windows to R-CNN
The Sliding Window Technique

To solve the problem of detecting multiple objects, researchers initially used the Sliding Window Technique.
How it works:
Window Selection: We select a small window (box) of a fixed size.
Sliding: We "slide" this window across the entire image, step by step.
Classification: At each step, we pass the content inside the window to our CNN model.
The model predicts if the window contains a specific class (e.g., Car, Motorcycle, Bike).
The Background Class (C+1): Since most windows will just contain empty road or sky, we add an extra class called "Background." If the window doesn't contain an object of interest, the model classifies it as Background.
The Major Flaws:
Computational Explosion: To catch every object, we have to slide the window thousands of times.
Scale & Aspect Ratio: Objects are not all the same size or shape. To catch a small car and a large truck, we would need to repeat the entire sliding process with multiple window sizes and aspect ratios.
Inefficiency: This results in millions of crops being passed to the CNN. Training and inference become insanely slow and computationally expensive (GPU intensive). It is simply not feasible for real-world applications.
The Motivation for R-CNN
To speed up the model, we have two options:
Use a smaller, faster network (which hurts accuracy).
Decrease the number of inputs.
R-CNN (Region-based CNN) focuses on option 2. The core idea is: Why pass millions of empty background boxes to the model?
Instead of blindly sliding a window everywhere, we use a smart algorithm to identify "Region Proposals"—areas of the image that are likely to contain an object.
Reducing Inputs: Region Proposals

We use an external algorithm called Selective Search to segment the image.
It looks at the image and groups pixels based on color, texture, and intensity.
It identifies "blobs" or regions that look like distinct objects.
The Result: Instead of millions of random windows, Selective Search gives us approximately 2,000 high-quality region proposals.
This drastic decrease (from millions to ~2,000) makes using a deep, powerful CNN feasible.
The R-CNN System Overview
The R-CNN object detection system consists of three distinct modules:
Region Proposal Generation:
This is category-independent. We use Selective Search to generate ~2,000 candidate regions that might contain an object. These proposals define the set of candidate detections available to our detector.
Feature Extraction (CNN):
We use a large Convolutional Neural Network (like AlexNet or VGG).
Each of the 2,000 proposed regions is resized and passed through this CNN to extract a fixed-length feature vector.
Classification (SVMs):
We use a set of class-specific linear SVMs.
These SVMs take the feature vector from the CNN and classify it (e.g., "Is this a Car?" or "Is this a Person?").
In Summary:
We moved from a brute-force approach (Sliding Window) to a selective approach (Region Proposals). By filtering the input before it reaches the heavy neural network, R-CNN makes object detection possible with high accuracy.
The Core Engine: Selective Search

We need a way to reduce millions of sliding windows down to a manageable number of "smart" guesses. We use an algorithm called Selective Search.
How Selective Search Works

Selective Search is a bottom-up algorithm that groups pixels based on hierarchical grouping.

Initial Segmentation: The algorithm first generates many small over-segmented regions (superpixels) based on pixel intensity. The image looks like a mosaic of tiny shards. 2. Recursive Merging: It iteratively merges adjacent regions together based on similarity.
Similarity Metrics: It looks at Color (similar histograms), Texture (gradient orientations), Size (merging small regions first), and Shape Compatibility (how well two regions fit together).

Result: As the algorithm runs, small segments combine into larger blobs. By the end, we get locations that are highly likely to contain distinct objects (cars, people, etc.).

Instead of millions of random boxes, Selective Search outputs approximately 2,000 Region Proposals per image.
Pre-processing: Solving the Size Problem
The Challenge

We now have ~2,000 region proposals, but they are all different shapes and sizes (aspect ratios). However, standard CNNs (like AlexNet or VGG) require a fixed input size (e.g., 227 x 227 pixels).
The Solution: Warping vs. Padding

We must reshape every proposal before feeding it into the CNN.
Warping (Standard R-CNN): We forcibly resize the image patch to 227 x 227 regardless of its original shape.
Pro: Simple and fast.
Con: It distorts the object (e.g., a tall person might look wide and short).
Padding (Dilation): To preserve the aspect ratio, we can add a border (padding) around the object to make it square before resizing. This prevents distortion.
In the original R-CNN, we take these warped/resized images and pass them to the CNN for feature extraction.
Labeling Data: Intersection over Union (IoU)
Before we can train the model, we need to label our 2,000 proposals. We don't know which proposal contains a car or background yet. We use a metric called Intersection over Union (IoU) to figure this out.
What is IoU ?
IoU measures the overlap between two bounding boxes: the Predicted Box (Proposal) and the Ground Truth Box (Manual Label).
IoU = Area of Intersection / Area of Union
Range: The value is between 0 and 1.
IoU ≈ 1: The boxes are perfectly aligned.
IoU ≈ 0: The boxes do not touch at all.
Creating the Training Data

We have the Ground Truth (GT) boxes (manually labeled by humans) and the 2,000 Selective Search proposals. We need to assign a label to each proposal to train our CNN.
The Labeling Logic (for CNN Fine-Tuning)

We compare every proposal with the Ground Truth boxes:
Positive Samples (Object): If a proposal has an IoU > 0.5 with any Ground Truth box, we label it as that object class (e.g., "Person").
Negative Samples (Background): If a proposal has an IoU < 0.5 with all Ground Truth boxes, we label it as "Background."
Edge Case: What if a proposal overlaps with two different Ground Truth objects ?

Solution: We assign the label of the Ground Truth object with the highest IoU.
Handling Class Imbalance
After labeling, we might have 1,900 background boxes and only 100 object boxes. If we train on this directly, the model will just learn to predict "Background" every time.
Solution: We use Random Sampling during training.
We construct a mini-batch (e.g., size 128) by randomly selecting:
32 Positives (Foreground objects)
96 Negatives (Background)
This ratio (usually 1:3) ensures the model sees enough object examples to learn effectively.
Summary of this Phase
Input: Image with multiple objects.
Selective Search: Generates ~2,000 proposals.
Warping: Resizes all proposals to 227 x 227.
Labeling: Uses IoU to tag proposals as "Object" or "Background."
Training: Fine-tunes the CNN using these labeled batches.
Next, we extract features from this fine-tuned CNN and train the final SVM classifiers.
Classification with SVMs
Once we have fine-tuned the CNN, we use it as a "Feature Extractor." Interestingly, the original R-CNN paper did not use the CNN's final Softmax layer for classification. Instead, it used Support Vector Machines (SVMs).

Why SVMs ?
The CNN provides a dense feature vector (a compressed representation of the image content). To get the final label, R-CNN trains a separate Linear SVM for every single class.
If we have 10 object classes, we train 10 separate binary SVMs.
One-vs-All Strategy: Each SVM asks a simple question: "Is this region a Car, or is it anything else?"
The Training Data & Thresholds

The data labeling strategy for training the SVMs is slightly different from the CNN fine-tuning phase. This "Hard Negative Mining" helps reduce false positives.
Positive Examples: Only the Ground Truth boxes are treated as positives for their respective class.
Negative Examples: Region proposals with IoU < 0.3 (overlap with Ground Truth) are treated as background.
Ignored Regions: Proposals with IoU between 0.3 and 1 (that are not Ground Truth) are generally ignored during SVM training. This prevents the model from getting confused by "partially correct" boxes.
The Workflow:
Pass the warped proposal through the CNN.
Extract the feature vector (from the Flatten/FC layer).
Pass this vector to all class-specific SVMs.
The SVM with the highest confidence score determines the class.
Bounding Box Regression
The Problem: "Good Enough" isn't Perfect
Selective Search gives us ~2,000 proposed boxes. While some might significantly overlap with the object, they are rarely pixel-perfect.
Example: The box might catch the person's body but cut off their feet, or it might be slightly shifted to the left.
Since we cannot change the Selective Search output, we have to "tweak" or "correct" the box using a model. This is where the Bounding Box Regressor comes in.
The Solution: Linear Transformation

We train a linear regression model to predict correction factors (offsets) that transform the Proposed Box (P) into the Ground Truth Box (G).
We need to learn four parameters (tₓ, t_y, t_w, tₕ). These parameters tell us how to shift the center and scale the size of our proposal.
The Transformation Equations:
If our Proposed Box has center (Pₓ, P_y) and size (P_w, Pₕ), and we want to reach the Ground Truth (Gₓ, G_y, G_w, Gₕ):
Center Adjustment (x, y):
Gₓ = P_w ⋅ tₓ + Pₓ
G_y = Pₕ ⋅ t_y + P_y
(We shift the center by a fraction of the width/height)
Size Adjustment (w, h):
G_w = P_w ⋅ exp(t_w)
Gₕ = Pₕ ⋅ exp(tₕ)
(We scale the width/height using an exponential to ensure positive values)
Training the Regressor

Input: The feature vector of the proposal from the CNN.
Target: The distance between the Proposal box and the Ground Truth box.
Constraint: We only train this regressor on proposals that are already "close" to the object (typically IoU > 0.6).

Why? Linear regression works well for small adjustments. If a box is far away from the object, the transformation required is too complex for a simple linear model, so we don't even try to fix it.
The Bottleneck:
Because we have to run the CNN 2,000 times per image (once for each proposal), this process is extremely slow. It takes approximately 40–50 seconds to process just one image. This high latency makes R-CNN unsuitable for real-time applications, paving the way for Fast R-CNN and Faster R-CNN.
Post-Processing: Non-Max Suppression (NMS)
The Problem: Multiple Detections

When we run our model on an image, it often gets "too excited" and detects the same object multiple times. For example, for a single person riding a horse, the model might output 5 or 6 slightly different bounding boxes, all claiming to be "Person."
We don't want 6 boxes for one person; we want the single best box. The process of eliminating these redundant, overlapping boxes is called Non-Max Suppression (NMS).
How NMS Works (Step-by-Step)
NMS is applied independently for each class. If we have "Person" and "Horse" classes, we run NMS on the "Person" boxes first, and then separately on the "Horse" boxes.
The Algorithm:
Filter & Sort:
Discard any box with a confidence score below a low threshold (e.g., 0.1).
Sort the remaining boxes in descending order of their confidence scores (Objectness Score).
Example: Box A (0.9), Box B (0.85), Box C (0.34)...
Select the Best Box:
Pick the box with the highest score (e.g., Box A with 0.9) as our "valid detection."
Suppress Overlaps:
Compare this selected box (Box A) with all other remaining boxes (Box B, C, etc.).
Calculate the IoU (Intersection over Union) between Box A and the others.
The Logic:
If IoU > 0.5: The boxes overlap significantly. We assume Box B is detecting the same object as Box A, but with lower confidence. Discard Box B.
If IoU < 0.5: The boxes barely touch or are far apart. We assume Box C might be a different person entirely. Keep Box C.
Loop:
Move to the next highest scoring box that hasn't been discarded (e.g., Box C) and repeat the process.
Result:
We are left with only the most confident boxes for each distinct object in the image.
Example Scenario
Imagine an image with two different people.
Person 1: Detected by Box A (0.9) and Box B (0.8). IoU(A, B) = 0.8.
Person 2: Detected by Box C (0.95). IoU(A, C) = 0.05.
Execution:
We start with the highest overall score: Box C (0.95).
We compare Box C with A and B. The IoU is very low (< 0.5), so A and B are kept (they are likely a different person).
We move to the next highest: Box A (0.9).
We compare Box A with Box B. The IoU is high (0.8). Since A is more confident, Box B is suppressed (deleted).
Final Output: Box C (Person 2) and Box A (Person 1).
The Complete R-CNN Training Pipeline
To build the R-CNN model from scratch, we follow this specific order:
Dataset Preparation:
Extract Region Proposals from images using Selective Search.
Warp/Resize all proposals to a fixed size (e.g., 227 x 227).
Pre-Training (Feature Extractor):
Train a standard CNN (like VGG16 or AlexNet) on a large classification dataset (e.g., ImageNet).
Fine-Tuning (The "R-CNN" Step):
Fine-tune the CNN on your specific object detection dataset using the warped proposals.
Labeling: IoU >= 0.5 is Object; IoU < 0.5 is Background.
Structure: (N+1) classes (Objects + Background).
SVM Training (The Classifier):
Freeze the CNN and use it only to extract features.
Train binary SVMs for each class.
Labeling: Ground Truth is Positive; IoU < 0.3 is Negative.
Bounding Box Regression:
Train a Linear Regressor to fix tight localization errors.
Data: Only train on proposals with IoU > 0.6.
Inference (Prediction):
Pass image → Selective Search → CNN → SVMs → Regressor → NMS → Final Output.
Evaluating Object Detection Models: mAP
Why Not Accuracy ?
For standard classification, we use metrics like Accuracy, F1-Score, or ROC-AUC.
Problem: In object detection, "Accuracy" is meaningless because the vast majority of the image is "Background." A model that predicts "Nothing" everywhere would have 99% accuracy but 0% utility.
Instead, the gold standard metric for object detection is mAP (Mean Average Precision).
Prerequisites: Precision & Recall
To understand mAP, we first need to define Precision and Recall in the context of bounding boxes.
Precision (Quality): Out of all the boxes the model predicted, how many were actually correct?

Recall (Quantity): Out of all the real objects (Ground Truth) in the images, how many did the model find ?

True Positive (TP): A predicted box with IoU >= Threshold (usually 0.5) to a Ground Truth box.
False Positive (FP): A predicted box with IoU < Threshold, or a duplicate detection.
False Negative (FN): A Ground Truth object that the model missed completely.
Step-by-Step: Calculating mAP

mAP is calculated per class (e.g., first calculate for "Dog," then "Cat," then average them). Let's walk through the calculation for the "Dog" class.
1. Sort Predictions

We assume we have 3 real dogs (Ground Truths) in our dataset. Our model makes 4 predictions.
First, we sort all predictions by their Confidence Score (highest to lowest).
Rank | Prediction ID | Confidence | IoU with GT | Result (Threshold 0.5) |
1 | Box A | 0.95 | 0.85 | TP |
2 | Box B | 0.90 | 0.20 | FP |
3 | Box C | 0.80 | 0.75 | TP |
4 | Box D | 0.65 | 0.10 | FP |
2. Calculate Precision & Recall at Each Step
We calculate the cumulative Precision and Recall as we move down the list.
Row 1 (Box A): It's a Match (TP).
Total TPs: 1. Total Predictions: 1.
Precision: 1/1 = 1.0
Recall: 1/3 = 0.33 (we found 1 out of 3 real dogs).
Row 2 (Box B): It's a Miss (FP).
Total TPs: 1. Total Predictions: 2.
Precision: 1/2 = 0.5 (accuracy dropped because of the bad guess).
Recall: 1/3 = 0.33 (still haven't found a new dog).
Row 3 (Box C): It's a Match (TP).
Total TPs: 2. Total Predictions: 3.
Precision: 2/3 ≈ 0.67
Recall: 2/3 ≈ 0.67 (found 2 out of 3 real dogs).
Row 4 (Box D): It's a Miss (FP).
Total TPs: 2. Total Predictions: 4.
Precision: 2/4 = 0.5
Recall: 2/3 ≈ 0.67
Average Precision (AP) & The PR Curve

If we plot these Precision (y-axis) and Recall (x-axis) values, we get the Precision-Recall (PR) Curve.
Average Precision (AP): This is the Area Under the Curve (AUC). It represents the model's performance for that specific class.
Example: For the "Dog" class, let's say the area under the curve is 0.58.
For the "Cat" class, we repeat the process and get an AP of 0.62.
The Final Metric: Mean Average Precision (mAP)

mAP is simply the average of the AP scores across all classes.

Example: (0.58 + 0.62) / 2 = 0.60.
Since we used an IoU threshold of 0.5, we write this as: mAP@0.5 = 0.60.
mAP
Using a single threshold of 0.5 is often considered too lenient. A box that barely touches the object counts as "correct."
Modern benchmarks (like COCO) calculate mAP at multiple thresholds and average them to reward high-precision localization.
Calculate mAP @ IoU 0.50
Calculate mAP @ IoU 0.55
Calculate mAP @ IoU 0.60
...
Calculate mAP @ IoU 0.95
We average all these values to get the primary metric:
mAP@[0.5:0.95]
Example:
mAP@0.50 = 0.60
mAP@0.75 = 0.40
mAP@0.95 = 0.10
Final Combined mAP = 0.55
This final number (0.55) tells us how good the model is at detecting objects and how perfectly it draws the boxes around them.


