Padding and Strides in CNNs Explained: Theory, Formulas, and Practical Intuition

Aryan
Jan 14
4 min read

WHY DO WE NEED PADDING IN CNN ?

After applying a convolution operation, we obtain a feature map. At this stage, two important issues arise.

First, the spatial dimensions of the feature map become smaller than the original image. If we apply multiple convolution layers without any correction, the feature map keeps shrinking at each layer. As the network goes deeper, this continuous reduction in size leads to loss of spatial information, especially fine details that may be important for learning.

Second, border pixels participate in fewer convolution operations compared to pixels near the center of the image. Middle pixels are covered by the filter many times, whereas border pixels are covered only a few times. As a result, the feature map is dominated by information from the central region, and the contribution of border pixels is reduced. If important features lie near the edges of the image, this information may be partially or completely lost.

Padding addresses both these problems by preserving spatial dimensions and ensuring that border pixels are treated more evenly during convolution.

WHAT IS PADDING ?

When we apply a convolution operation, the size of the feature map depends on the image and filter dimensions. If we have an n × n image and an m × m filter, the resulting feature map has dimensions

(n−m+1)×(n−m+1).

In many cases, we want the spatial size of the feature map to remain the same as the original image. Since changing the filter size is not desirable, we instead modify the image dimensions. For example, with a 5×5 image and a 3×3 filter, the output feature map becomes 3×3. To retain a 5×5 output, we need to increase the effective image size before convolution.

To achieve this, we conceptually solve n−m+1=5. This gives n=7, meaning we expand the original 5×5 image to a 7×7 image.

This expansion is done by adding rows and columns around the image boundaries—at the top, bottom, left, and right. In this case, we add one row on the top and bottom, and one column on the left and right, resulting in two additional rows and two additional columns overall. The values added in these boundary regions are zeros. These added boundaries are called padding, and because the added values are zero, this method is known as zero padding.

After converting the image from 5×5 to 7×7, we apply the 3×3 filter. The resulting feature map now has dimensions 5×5, matching the original image size.

Mathematically, without padding, the feature map size is calculated as n−m+1. With padding p, the formula becomes

n+2p−m+1.

For a 5×5 image, a 3×3 filter, and padding p=1:

5+2(1)−3+1=5.

Thus, padding allows us to preserve spatial dimensions while performing convolution.

In Keras, there are two common padding options. valid means no padding is applied, so the feature map size reduces after convolution. same applies padding automatically so that the output feature map has the same spatial dimensions as the input.

STRIDES

In convolution, we slide a filter across the image to compute feature values. The number of pixels by which the filter moves at each step is called the stride. When the filter moves one pixel to the right and one pixel downward after each convolution, the stride is (1,1). This means the filter shifts one column horizontally and one row vertically at a time.

We can also use larger strides, such as (2,2). In this case, after performing one convolution, the filter jumps two pixels to the right. Similarly, when moving downward, it skips two rows at a time. As the stride increases, the filter covers fewer positions on the image, which directly reduces the spatial size of the resulting feature map.

Without stride (i.e., stride = 1), the feature map size is given by n−f+1, where n is the image size and f is the filter size. When stride s is applied, this formula becomes

If padding p is also used, the formula generalizes to

For example, if we have a 7×7 image, a 3×3 filter, and a stride of 2, the feature map size becomes

(7−3)/2+1=3, resulting in a 3×3 feature map.

If we apply padding with p=1 along with stride 2, the size becomes

(7+2−3)/2+1=4, giving a 4×4 feature map.

Thus, using stride generally makes the feature map smaller, and increasing the stride further reduces its size. When the stride is greater than 1, this operation is commonly referred to as strided convolution.

Special Case

Consider a 6×7 image with a 3×3 filter and stride 2. The filter moves two pixels at a time horizontally and vertically. In this case, when the filter reaches the lower part of the image, there may not be enough pixels left to perform the final convolution step. As a result, the feature map becomes incomplete along one dimension.

Using the formula, for the rows we get

(6−3)/2+1=1.5+1.

Since we cannot have fractional feature map sizes, we apply the floor operation, giving 1+1=2.

For the columns,

(7−3)/2+1=3.

Hence, the final feature map size is 2×3. In such cases, flooring is applied to handle uneven coverage caused by larger strides.

WHY DO WE NEED STRIDED CONVOLUTION ?

There are two main reasons for using strided convolution.

First, when we are interested in higher-level features rather than fine-grained, low-level details. Larger strides reduce spatial resolution, allowing the network to focus more on coarse, abstract patterns.

Second, computational efficiency. Increasing the stride reduces the size of the feature map, which lowers the number of computations and speeds up training, especially for large datasets. Although modern hardware has reduced this concern, strided convolution is still used in practice for efficient downsampling in many architectures.

Padding and Strides in CNNs Explained: Theory, Formulas, and Practical Intuition

Recent Posts

© 2025 Aryan Upadhyay |