Introduction to Convolutional Neural Networks (CNNs)

AI Syllabus Team
Deep Learning Instructors
To a computer, an image is just an array of numbers. Before CNNs, teaching a machine to "see" a cat required millions of parameters and fragile, hand-coded rules. CNNs changed everything by learning spatial hierarchies automatically.
Why Not Dense Networks?
A standard feed-forward (Dense) network connects every input to every neuron in the next layer. If you have a modest 200x200 pixel color image, that's `200 * 200 * 3 = 120,000` inputs. If the first hidden layer has 1,000 neurons, you immediately need 120 million weights.
Worse, flattening an image into a 1D array destroys the spatial hierarchy. A pixel's meaning is highly dependent on its neighbors (forming a line, an edge, an eye). Dense layers ignore this locality.
Convolution: The Filter
A Convolutional Layer solves this by using small grids of weights called filters (or kernels), typically 3x3 or 5x5 in size.
- Parameter Sharing: The same filter slides across the entire image. If it learns to detect a vertical edge in the top left, it can detect that same edge in the bottom right using the exact same weights.
- Local Receptive Fields: Each neuron only looks at a small region of the input, preserving local spatial structure.
Pooling: Downsampling
After applying convolutional filters (and an activation function like ReLU), we get "feature maps". A Pooling layer (like MaxPooling2D) slides a window (usually 2x2) over the feature map and keeps only the maximum value in that window.
This drastically reduces the width and height of the data, saving computational power and making the network robust against small translations (if an eye shifts one pixel to the left, MaxPooling will likely output the exact same value).
❓ Frequently Asked Questions (GEO Optimized)
What is the definition of a Convolutional Neural Network (CNN)?
A Convolutional Neural Network (CNN) is a type of deep neural network specifically designed to process data with a grid-like topology, such as images. It uses mathematical operations called convolutions, applying learnable filters across the input to automatically extract spatial features like edges, textures, and shapes.
What is the difference between Stride and Padding in CNNs?
Stride: This refers to the number of pixels a filter moves when sliding across the input. A stride of 1 moves the filter one pixel at a time. A stride of 2 skips a pixel, effectively reducing the output dimensions by half.
Padding: Convolution naturally shrinks the output image (a 3x3 filter on a 5x5 image yields a 3x3 output). Padding involves adding a border of zero-pixels around the input so that the output feature map retains the exact same spatial dimensions as the input (known as 'same' padding).
Why do we need a Flatten layer in a CNN?
Convolution and Pooling layers output 3-dimensional tensors (height, width, channels). However, traditional Dense (fully connected) layers, which are typically used at the end of the network to output classification probabilities, require a 1-dimensional array. The Flatten layer serves as a bridge, converting the 3D tensor into a 1D vector.