Feature Extraction: Convolutions & Pooling
In traditional neural networks, flattening an image discards critical spatial information. Convolutional Neural Networks (CNNs) preserve this structure by applying localized filters, enabling the model to "see" edges, textures, and shapes.
What is a Convolution?
A convolution is a mathematical operation applied to images. Imagine a small grid of numbers, called a Kernel or Filter (often 3x3). This kernel slides across the original image pixel by pixel. At each step, we multiply the kernel values by the underlying pixel values and sum them up.
This process creates a new 2D matrix called a Feature Map. Different kernels can be trained to detect different featuresβone might find horizontal edges, while another detects colors or corners.
Visualizing the sliding window of a kernel over an input matrix.
Dimensionality Reduction via Pooling
After multiple convolutions, the network generates a massive amount of data. Pooling layers solve this by downsampling the feature maps, reducing the computational load and preventing overfitting.
- Max Pooling: The most common technique. It takes a small window (e.g., 2x2) and only keeps the maximum value. This retains the most prominent features (like a sharp edge) while discarding the rest.
- Average Pooling: Computes the average of the values in the window. It's less common today but used in specific architectures like Global Average Pooling (GAP) before the final output layer.
Max Pooling shrinking a 4x4 grid into a 2x2 grid by extracting maximum values.
β Deep Learning FAQ: Convolutions
What is Stride in a Convolutional Layer?
Stride refers to how many pixels the filter moves at a time. A stride of 1 means the filter moves 1 pixel per step. A stride of 2 skips a pixel, effectively reducing the spatial dimensions of the output feature map by half.
What is Padding (Same vs. Valid)?
When a filter slides over an image, the edges are processed less than the center, and the output shrinks. Padding (usually zero-padding) adds a border of zeros around the image. `padding='same'` keeps the output dimensions the same as the input. `padding='valid'` means no padding, so the output shrinks.
Why do we use ReLU after a Convolution?
A convolution is a linear operation (just multiplications and additions). To learn complex, real-world patterns, we must introduce non-linearity. The ReLU (Rectified Linear Unit) activation function is standard because it effectively removes negative values, preventing the vanishing gradient problem and speeding up training.