Deep Learning for Vision:
VGG & ResNet Architectures
The ImageNet challenge spawned architectures that dictate modern Computer Vision. Understanding VGG's depth strategy and ResNet's skip connections is the key to building robust image classification, detection, and segmentation models.
1. The VGG Paradigm (Visual Geometry Group)
Introduced in 2014, VGG networks (like VGG16 and VGG19) simplified the structural design of Convolutional Neural Networks. Prior networks like AlexNet used large receptive fields (11x11, 7x7) in the first convolutional layers.
VGG established a new rule: Only use 3x3 convolutions. By stacking two 3x3 convolution layers, you achieve an effective receptive field of 5x5. Stacking three gives you 7x7. The advantage? You incorporate more non-linear activation functions (ReLU), making the decision function more discriminative, and you actually decrease the number of weights (parameters) in the model.
2. The Degradation Problem
Armed with VGG's logic, researchers tried building incredibly deep networks (50, 100, or 150 layers). Surprisingly, adding more layers eventually led to higher training error. This wasn't overfittingβit was the Vanishing Gradient Problem.
During backpropagation, gradients are multiplied at each layer. If gradients are less than 1, multiplying them repeatedly across 100 layers causes them to exponentially shrink to zero. The early layers of the network simply stop learning.
3. ResNet (Residual Networks)
ResNet solved the degradation problem by introducing Skip Connections (or Shortcut Connections). Instead of hoping each few stacked layers directly fit a desired underlying mapping, ResNet explicitly lets these layers fit a residual mapping.
- The Math: Instead of learning
H(x), the network learnsF(x) = H(x) - x. The original inputxis then added back at the end:F(x) + x. - The Benefit: If a layer is unnecessary, the network can simply set its weights to zero. The skip connection ensures the input
xpasses through unmutated, maintaining performance instead of degrading it.
β Frequently Asked Questions
What is the difference between VGG16 and ResNet50?
VGG16: A traditional, sequential architecture with 16 layers. It uses exclusively 3x3 convolutions and is highly uniform, but it has a massive number of parameters (~138 million) making it heavy to load and train.
ResNet50: A 50-layer deep network that utilizes skip connections. Despite being much deeper than VGG16, ResNet50 actually has fewer parameters (~25 million) and generally achieves higher accuracy because it trains better without vanishing gradients.
What is a Skip Connection?
A skip connection (or shortcut) takes the output of a previous layer and bypasses one or more intermediate layers, adding it directly to the output of a later layer. This creates an "express lane" for gradients during backpropagation, solving the vanishing gradient problem.
Which architecture should I use for Transfer Learning?
For modern applications, ResNet (or newer variants like EfficientNet) is generally preferred over VGG. ResNet provides a better balance of accuracy and computational efficiency. However, VGG is often used for feature extraction in tasks like Neural Style Transfer because its sequential feature maps are very clean and interpretable.
