CNN Architecture Explained: LeNet-5 Architecture with Layer-by-Layer Breakdown
- Aryan

- Jan 18
- 3 min read
CNN Architecture (Conceptual Overview)

We have learned the core building blocks of CNNs: convolution, padding, stride, and pooling. Using these concepts, we can design a complete CNN architecture. A basic CNN follows a structured pipeline from raw image input to final prediction.
Consider an input RGB image of size 32 × 32 × 3. This image is first passed to a convolution layer, which contains learnable filters (also called kernels). Suppose we use three filters; since the input has three channels, each filter is a 3-channel filter. The first convolution layer produces a feature map, which is a 3D volume where the depth corresponds to the number of filters used.
After convolution, we apply a non-linear activation function, typically ReLU, to introduce non-linearity. The activated feature map is then passed through a pooling layer, which reduces the spatial dimensions while retaining important information. The output is again a tensor.
This process—convolution → activation → pooling—can be repeated multiple times. Each repetition helps the network learn increasingly abstract and high-level features. Once we finish stacking convolution and pooling layers, we obtain a final 3D tensor.
This tensor is then flattened, meaning it is converted from a 3D structure into a 1D vector. This vector is fed into one or more fully connected layers, similar to those used in traditional artificial neural networks (ANNs). These layers learn how to combine the extracted features to make predictions.
Finally, we add an output layer, where the choice of activation and loss function depends on the task. For example, we may use a sigmoid or binary cross-entropy for binary classification, or softmax with categorical cross-entropy for multi-class classification. This overall flow represents the basic CNN architecture.
By modifying design choices—such as the number of convolution layers, number of filters, kernel size, stride, padding, number of fully connected layers, number of neurons, or whether to use dropout—we can create different CNN architectures suited for different problems.
In large-scale benchmarks like ImageNet, researchers propose and evaluate various architectures by experimenting with these components. The earliest CNN architecture was LeNet, introduced by Yann LeCun. Later architectures such as AlexNet, GoogLeNet, VGGNet, ResNet, and Inception introduced architectural innovations to improve depth, performance, and training stability.
If we want to design our own CNN, we essentially start from this basic structure and introduce variations in depth, width, and connectivity to match the problem requirements.
LeNet

Yann LeCun is often referred to as the father of CNNs. He began working on convolutional neural networks in the late 1980s and published the LeNet research paper in 1998. LeNet was developed as part of a project for the US Postal Service, where CNNs were used to automatically read handwritten ZIP codes.
The architecture is commonly known as LeNet-5 because it consists of five learnable layers.
LeNet-5 Architecture
LeNet takes an input image of size 32 × 32 (grayscale). The flow of the network is as follows:
The input image is first passed through a convolution layer with 6 filters of size 5 × 5, producing a feature map of size 28 × 28 × 6. This is followed by an average pooling layer with a 2 × 2 receptive field and stride 2, which reduces the size to 14 × 14 × 6.
Next, a second convolution layer is applied using 16 filters of size 5 × 5, resulting in a feature map of 10 × 10 × 16. This is again followed by average pooling (2 × 2, stride 2), reducing the output to 5 × 5 × 16.
The resulting tensor is then flattened, giving a 1D vector of size 400. This vector is passed to a fully connected layer with 120 neurons, followed by another fully connected layer with 84 neurons. Finally, an output layer with 10 neurons (using softmax) is used to classify digits from 0 to 9.
Activation and Design Choices
LeNet uses the tanh activation function instead of ReLU. At the time (1998), tanh was considered one of the best-performing activation functions, as ReLU had not yet become popular.
In terms of parameter flow:
First FC layer weights: 400 × 120
Second FC layer weights: 120 × 84
Output layer weights: 84 × 10
Although pooling layers are present, LeNet-5 is called a five-layer network because only layers with learnable parameters are counted: two convolution layers and three fully connected layers.
LeNet-5 was the first CNN to successfully solve a real-world problem, and it laid the foundation for deeper architectures developed later. As neural networks evolved, architectures became deeper and more complex, but the core ideas introduced in LeNet remain fundamental to modern CNN design.


