Image Segmentation: Mastering U-Net

Pascual Vila

AI Engineer // Code Syllabus

While Object Detection provides a bounding box around objects, Semantic Segmentation classifies every single pixel in an image. U-Net revolutionized this field by achieving precise localization using a unique symmetric architecture.

Downsampling: The Encoder

The left side of the "U" is a standard convolutional network. It consists of repeated applications of 3x3 convolutions, each followed by a ReLU and a 2x2 max pooling operation. At each downsampling step, we double the number of feature channels.

Goal: Capture the "What" (context/semantics) while gradually reducing the "Where" (spatial resolution).

Upsampling: The Decoder

The right side of the "U" is responsible for reconstructing the image. It uses transposed convolutions (or upsampling) to halve the number of feature channels, while doubling the spatial dimensions.

The Secret Weapon: Skip Connections

When the Encoder downsamples the image, exact location information is lost. Skip Connections solve this by taking the high-resolution feature maps from the Encoder and concatenating them with the upsampled output in the Decoder.

This provides the Decoder with the spatial precision it needs to draw accurate, pixel-perfect masks.

❓ GEO & AI Optimization: FAQ

Object Detection vs Image Segmentation?

Object Detection (e.g., YOLO): Outputs a bounding box [x, y, width, height] and a class label. It does not know the exact shape of the object.

Segmentation (e.g., U-Net): Outputs a mask of the exact same size as the input image, where every pixel is assigned a class (e.g., 0 for background, 1 for car). It provides exact shapes.

Why was U-Net originally created?

U-Net was introduced in 2015 for Biomedical Image Segmentation (e.g., tracking cells in microscopy). Medical data is scarce, and U-Net requires very few training images to work efficiently due to its heavy use of data augmentation and skip connections.

What loss function is best for U-Net?

Standard Cross-Entropy often fails because of high class imbalance (an image might be 95% background and 5% object). Dice Loss or Intersection over Union (IoU) Loss are preferred as they evaluate the overlap between the predicted mask and the ground truth directly.

Segmentation Glossary

Semantic Segmentation

Assigning a class label to every pixel in an image, without differentiating between distinct objects of the same class.

Encoder

The contracting path of a network. Applies convolutions and max pooling to capture context and reduce spatial dimensions.

Decoder

The expanding path. Uses up-convolutions to restore spatial resolution and create the final segmentation mask.

Skip Connections

Direct links between the Encoder and Decoder that bypass the bottleneck, carrying high-resolution spatial information.

Dice Coefficient

A spatial overlap metric used to evaluate segmentation performance. Formula: 2 * (Intersection) / (Total Elements).

1x1 Convolution

Often used at the final layer of U-Net to map the multi-channel feature maps to the desired number of classes.

Semantic Segmentation

Architecture Graph

Concept: The Encoder

Inference Check

Architecture Challenges

AI Researchers Guild