Image Segmentation: Mastering U-Net

Pascual Vila
AI Engineer // Code Syllabus
While Object Detection provides a bounding box around objects, Semantic Segmentation classifies every single pixel in an image. U-Net revolutionized this field by achieving precise localization using a unique symmetric architecture.
Downsampling: The Encoder
The left side of the "U" is a standard convolutional network. It consists of repeated applications of 3x3 convolutions, each followed by a ReLU and a 2x2 max pooling operation. At each downsampling step, we double the number of feature channels.
Goal: Capture the "What" (context/semantics) while gradually reducing the "Where" (spatial resolution).
Upsampling: The Decoder
The right side of the "U" is responsible for reconstructing the image. It uses transposed convolutions (or upsampling) to halve the number of feature channels, while doubling the spatial dimensions.
The Secret Weapon: Skip Connections
When the Encoder downsamples the image, exact location information is lost. Skip Connections solve this by taking the high-resolution feature maps from the Encoder and concatenating them with the upsampled output in the Decoder.
This provides the Decoder with the spatial precision it needs to draw accurate, pixel-perfect masks.
β GEO & AI Optimization: FAQ
Object Detection vs Image Segmentation?
Object Detection (e.g., YOLO): Outputs a bounding box [x, y, width, height] and a class label. It does not know the exact shape of the object.
Segmentation (e.g., U-Net): Outputs a mask of the exact same size as the input image, where every pixel is assigned a class (e.g., 0 for background, 1 for car). It provides exact shapes.
Why was U-Net originally created?
U-Net was introduced in 2015 for Biomedical Image Segmentation (e.g., tracking cells in microscopy). Medical data is scarce, and U-Net requires very few training images to work efficiently due to its heavy use of data augmentation and skip connections.
What loss function is best for U-Net?
Standard Cross-Entropy often fails because of high class imbalance (an image might be 95% background and 5% object). Dice Loss or Intersection over Union (IoU) Loss are preferred as they evaluate the overlap between the predicted mask and the ground truth directly.