How CNNs Work: A Comprehensive Guide to the Convolution Operation

Aryan
Jan 12
6 min read

Introduction

A Convolutional Neural Network (CNN) is a specialized type of neural network whose architecture differs from a traditional Artificial Neural Network (ANN). Unlike ANNs, CNNs include convolution layers, pooling layers, and fully connected layers, which make them particularly effective for image-related tasks. This architecture is inspired by the human visual cortex, allowing CNNs to learn spatial patterns efficiently.

As shown in the image, a CNN typically begins with convolution layers and ends with fully connected layers. When an image is fed into a CNN, the initial convolution layers perform edge detection by identifying basic patterns such as lines and corners. These early layers focus on primitive features. As we move deeper into the network, subsequent convolution layers combine these simple features to extract more complex patterns, ultimately leading to high-level feature representations used for classification.

For example, in face recognition, early layers detect edges, which then combine to form parts such as eyes, nose, and ears. Deeper layers integrate these parts to represent the complete face, after which the fully connected layers perform classification. This hierarchical feature learning is the core idea behind CNNs, with the convolution layer being the most critical component of the architecture.

Basics of Images

To understand CNNs, it is essential to first understand what images are and how they are stored in computer memory. In practice, we mainly work with two types of images: grayscale images and RGB (colored) images.

Let us start with grayscale images. A grayscale image is composed of pixels arranged in a grid-like structure. Each pixel represents an intensity value rather than a color. For example, a low-resolution image of size 28 × 28 consists of 28 rows and 28 columns of pixels. Each pixel stores a numerical value between 0 and 255, where 0 represents black, 255 represents white, and values in between represent different shades of gray. Conceptually, a grayscale image is simply a collection of pixels, and mathematically it can be stored as a two-dimensional array. Computers typically store such images as 2D NumPy arrays, where each cell corresponds to the intensity value of a pixel.

In the case of colored images, the representation is slightly different. A color image is composed of three channels: red, green, and blue (RGB), which are the primary color components. By combining different intensities of these three channels, we can generate a wide range of colors.

Each channel in an RGB image can be thought of as an independent grayscale image, where pixel values again lie between 0 and 255. These channels are stacked together, usually in the order red, green, and blue. As a result, a colored image is represented as a three-dimensional array with shape height × width × 3 (for example, 228 × 228 × 3). In contrast, a grayscale image has only a single channel and is therefore represented as a two-dimensional array.

This numerical and structured representation of images is what makes them suitable inputs for convolutional neural networks.

EDGE DETECTION (CONVOLUTION OPERATION)

The main purpose of convolution is to extract features from an image. Let us understand this using a simple example of edge detection.

We consider an image of size 6×6 and apply an edge detection operation on it. In convolution, we always deal with three components: the input image, a filter (also called a kernel), and the output feature map.

In this image, the upper pixel values are 0, which represent black, and the lower pixel values are 255, which represent white.

Visually, this image looks like a black region on the top and a white region at the bottom. After applying edge detection, the output should ideally highlight the boundary between these two regions.

To perform convolution, we use a filter (kernel), which is essentially a small matrix. Most commonly, filters are of size 3×3, although larger filters can also be used. The filter used here is a horizontal edge detection filter, meaning it is designed to detect horizontal changes in pixel intensity. When this filter is convolved over the image, the result is a feature map.

For the convolution operation, we place the filter on the top-left corner of the image and perform element-wise multiplication between the filter values and the corresponding image pixels. After multiplying, we sum all the values to get a single number.

For example, the calculation at the first position is:

0×(−1) + 0×(−1) + 0×(−1) + 0×0 + 0×0 + 0×0 + 1×0 + 1×0 + 1×0 = 0

This value is placed in the first row and first column of the feature map.

Next, we slide the filter one step to the right and repeat the same element-wise multiplication and summation. The computed value is inserted into the next position of the feature map. We continue moving the filter horizontally, performing the same calculations. In this case, the result remains zero for these positions.

After completing the horizontal movement, we move the filter one row down and again slide it from left to right, calculating values at each position and inserting them into the feature map.

By repeating this process over the entire image, we obtain the final feature map.

The resulting feature map highlights the horizontal edge present in the image. When visualized, the output image shows a strong response at the boundary between the black and white regions, which is exactly what we expect from a horizontal edge detector.

This process is called edge detection. In this example, we detected a horizontal edge because we used a horizontal edge detection filter. If we use a different filter, such as a vertical edge detector or diagonal edge detectors, we can detect other types of edges. There are many possible filters, each sensitive to a different pattern or direction.

An important advantage of deep learning is that we do not need to manually design these filters. In Convolutional Neural Networks (CNNs), filter values are initialized randomly. During training, backpropagation automatically adjusts these filter values based on the data. These filters behave like weights in a neural network, similar to weights in an artificial neural network (ANN). Depending on the training data and images, CNNs learn the most useful filter values on their own.

In the example discussed, we used a 6×6 image and a 3×3 filter, which produced a 4×4 feature map. Similarly, if we apply a 3×3 filter to a 28×28 image (with no padding and stride 1), the resulting feature map will be of size 26×26.

In general, if the image size is n×n and the filter size is m×m, then the feature map size is:

(n − m + 1) × (n − m + 1)

In the visualization, red and blue colors represent different activation responses produced by the filter. Red indicates positive activation, meaning the filter has strongly detected a top edge at that location. The stronger the red color, the more confidently the filter is identifying a top edge. Blue represents negative activation, which corresponds to the opposite pattern, in this case a bottom edge. Since our goal is to detect top edges, positive (red) activations are the ones of interest.

Once the feature map is generated, an activation function is applied. Commonly, ReLU (Rectified Linear Unit) is used. ReLU sets all negative values to zero while keeping positive values unchanged. As a result, the blue regions (negative activations) are removed, and only the red regions (positive activations corresponding to top edges) remain. The final feature map therefore highlights only the desired top-edge information.

You can explore this behavior interactively at:

https://deeplizard.com/resource/pavq7noze2

Working with RGB Images

In RGB images, we work with three channels: red, green, and blue. Accordingly, the filter used for convolution must also span all three channels. This means the filter size becomes 3 × 3 × 3, where each slice of the filter corresponds to one color channel. You can think of the image as a cuboid with depth 3 and the filter as a smaller cuboid with the same depth. The convolution operation then proceeds by sliding this filter across the image, just as in the grayscale case, while performing calculations across all channels.

When convolution is performed between an RGB image and a 3 × 3 × 3 filter, the output at each spatial location is a single value. This happens because the filter contains 27 values, which are multiplied element-wise with the corresponding 27 pixel values from the image patch and then summed to produce one number. That number is placed in the feature map. As a result, convolving a three-channel image with a three-channel filter produces a single-channel feature map.

In general, if the input image has size n × n × c and the filter has size m × m × c, the resulting feature map will have spatial dimensions (n − m + 1) × (n − m + 1) with a single channel.

Multiple Filters

In practice, it is rare to apply only a single filter. Instead, multiple filters are used to capture different types of features. For example, one filter may focus on vertical edge detection while another detects horizontal edges. Each filter produces its own feature map.

Continuing with the same example, if we apply two different 3 × 3 × 3 filters to a 6 × 6 × 3 image, we obtain two separate feature maps, each of size 4 × 4. These feature maps are then stacked together, forming a combined feature volume of size 4 × 4 × 2. This volume becomes the input to the next convolutional layer.

More generally, if we use k filters, the output will have k channels. For instance, using 10 filters would produce a feature volume of size 4 × 4 × 10. Thus, the number of filters applied in a convolutional layer directly determines the depth (number of channels) of the output feature map.

How CNNs Work: A Comprehensive Guide to the Convolution Operation

Recent Posts

© 2025 Aryan Upadhyay |