top of page

Pooling in CNNs Explained: Translation Variance, Memory Efficiency, and Types of Pooling Layers

  • Writer: Aryan
    Aryan
  • Jan 16
  • 6 min read

The Problem with Convolution

There are two main issues associated with repeated convolution operations.

The first issue is memory consumption, and the second is translation variance.

Let us understand this with an example.

Suppose we have an input image of size 228 × 228 × 3 and we apply 100 filters of size 3 × 3 × 3. After convolution (with stride 1 and no padding), the resulting feature map size becomes 226 × 226 × 100. This means we must store a very large number of values in memory.

If we use 32-bit floating-point values, storing a single such feature map requires roughly 19 MB of RAM. This is only for one training sample. If we process a batch of 100 images, the memory requirement increases to around 2 GB for a single batch. This level of memory usage can easily slow down the system or even cause the program to crash.

Therefore, it becomes necessary to reduce the spatial size of feature maps as the network goes deeper. One possible way to do this is by increasing the stride value during convolution. However, in practice, pooling layers are preferred over large strides.

The reason is that pooling not only reduces the feature map size and memory usage, but it also helps address the translation variance problem. Pooling makes the network more robust to small shifts or translations in the input image, which is a desirable property in computer vision tasks.

This is why pooling is commonly used after convolution layers in CNN architectures.

 

Translation Variance

A convolution operation exhibits translation variance. Convolution layers are designed to detect features in an image, such as edges, corners, or textures. However, when a feature is detected, it is recorded at a specific spatial location in the feature map. This makes the detected features location-dependent.

For example, if a cat appears on the right side of an image, the corresponding activation appears on the right side of the feature map. If the same cat appears on the left side, the activation shifts to the left. The feature itself is the same, but its position changes.

This becomes a problem in image classification tasks. For classification, the exact location of a feature is usually not important; what matters is whether the feature exists, not where it appears. However, because convolution preserves spatial locations, the next layers may treat the same feature differently simply because it appears at a different position.

This issue—where features are tied to their spatial location—is known as translation variance.

Pooling helps solve this problem. By downsampling the feature map, pooling reduces spatial sensitivity and makes the network less dependent on the exact position of features. As a result, features become more translation-invariant, meaning small shifts in the input image do not significantly affect the network’s output.

This is the key reason pooling layers are used after convolution layers in CNNs.

 

Pooling

Pooling is a downsampling operation applied after the convolution layer. Its primary role is to reduce the spatial size of the feature map, which helps control memory usage and computational cost. In CNN architectures, a pooling layer is typically placed after convolution and the activation function.

 

Example and Flow

First, we apply a convolution operation using an image and a filter to obtain a feature map. Next, we introduce non-linearity by applying an activation function, commonly ReLU. Once we obtain this non-linear feature map, we apply the pooling operation.

There are different types of pooling techniques. The most commonly used is max pooling, but other variants also exist, such as average pooling, min pooling, L2 pooling, and global pooling. Consider a feature map with the following values.

Max Pooling Operation

To perform pooling, we first define the window size and the stride. A common choice is a (2 × 2) window with a stride of 2, although these values can be adjusted based on requirements.

With a (2 × 2) window, we look at four values at a time and select the maximum value. In the first window, the maximum value is 5. We then move the window to the right by a stride of 2 and select the next maximum, which is 3. Next, we move the window downward by two steps, where the maximum value is 7, and finally move right again to get 4.

After completing this process, the resulting feature map becomes (2 × 2) in size. This demonstrates how pooling effectively reduces the size of the feature map while retaining important information.

Why Pooling Works

Pooling also helps address translation variance. Within each small receptive field, all values represent similar local features, and max pooling selects only the dominant feature. Lower-level details are suppressed, while more meaningful, higher-level features are preserved.

Because of this downsampling, small shifts or translations in the input image do not significantly affect the pooled output. This introduces translation invariance, which makes the model more robust and improves its generalization ability.

 

Pooling on Volumes

Let us understand how pooling works on RGB images when multiple filters are applied.

Assume we have an input image of size 6 × 6 × 3 and we apply 3 × 3 × 3 convolution filters. After convolution, we obtain two feature maps of size 4 × 4 (one per filter). We first apply the ReLU activation to each feature map and then perform pooling.

When multiple feature maps are present, pooling is applied independently to each feature map. For a pooling window of 2 × 2 with stride 2, each 4 × 4 feature map is reduced to 2 × 2. Since there are two feature maps, the output after pooling becomes 2 × 2 × 2.

The same logic applies at scale. If we apply 100 filters, the convolution output will be 4 × 4 × 100, and after pooling, it becomes 2 × 2 × 100. Pooling never mixes information across channels; it operates channel-wise on individual feature maps.

 

Advantages of Pooling

 

  1. Reduction in feature map size

    Consider an RGB image of size 228 × 228 × 3. Applying 100 filters of size 3 × 3 × 3 produces a feature map of 226 × 226 × 100. If we then apply pooling with a 2 × 2 window and stride 2, the spatial dimensions are reduced to 113 × 113 × 100, significantly lowering memory and computation requirements.

  2. Translation invariance

    Translation invariance means that the exact location of a feature is not important; what matters is the presence of the feature. Pooling performs downsampling, which encourages the model to focus on high-level features while suppressing less important low-level details. After max pooling, similar features produce similar outputs even if their positions in the image change slightly.

  3. Enhanced dominant features (Max Pooling)

    This effect is specific to max pooling. Within a small receptive field, max pooling selects the strongest activation. As a result, dominant features such as edges or textures become more prominent, while weaker responses are discarded. The output feature map therefore highlights the most informative features.

  4. No training required

    Pooling is a fixed aggregation operation and does not involve any learnable parameters. There is no training step for pooling layers. We only need to specify the local receptive field size, stride, and type of pooling, making pooling simple and computationally efficient.

 

Types of Pooling

 

  1. Max Pooling

    In max pooling, we select the maximum value from each local pooling window. This allows the network to retain the most dominant activation within a small receptive field and discard weaker responses.

  2. Average Pooling

    In average pooling, we compute the average value of all elements within the pooling window. Each region is represented by its mean value, which results in smoother feature maps but may suppress strong activations.

  3. Global Pooling

    Global pooling operates on the entire feature map instead of local regions. It has two common variants:

    • Global Max Pooling, where a single maximum value is extracted from the whole feature map.

    • Global Average Pooling, where a single average value is computed from the entire feature map.

      If we have multiple feature maps, global pooling produces one scalar per feature map. For example, with three filters, the output becomes 1 × 3.

      Global max pooling is often used at the end of CNN architectures as a replacement for the flatten operation before the fully connected layer. This significantly reduces the number of parameters and helps reduce overfitting.

 

Disadvantages of Pooling

 

  1. Loss of spatial information

    Pooling introduces translation invariance, which is beneficial for tasks like image classification where feature location does not matter. However, in tasks such as image segmentation, where precise spatial location is important, pooling can be harmful. For such tasks, pooling is often avoided or carefully controlled.

  2. Information loss

    Since pooling downsamples the feature map, a significant amount of information is discarded. Fine-grained details may be lost, which can negatively affect tasks that require high spatial resolution.




bottom of page