COMPUTER VISION /// SEGMENTATION /// U-NET /// SKIP CONNECTIONS /// TENSORS /// CONVOLUTIONS /// PYTORCH ///

Semantic Segmentation

Go beyond bounding boxes. Master the U-Net architecture to classify every pixel for medical imaging, autonomous driving, and more.

unet.py
1 / 5
πŸ—ΊοΈ

Tutor:While Object Detection draws boxes, Image Segmentation colors every exact pixel belonging to an object. Welcome to U-Net.


Architecture Graph

FORWARD PASS TO UNLOCK NODES.

Concept: The Encoder

Reduces spatial dimensions (Width, Height) while increasing feature channels. Learns the "What".

Inference Check

Which operation is primarily responsible for reducing spatial dimensions in the Encoder?


AI Researchers Guild

Share Your Models

ONLINE

Trained a cool segmentation model? Share your weights and masks for feedback!

Image Segmentation: Mastering U-Net

Author

Pascual Vila

AI Engineer // Code Syllabus

While Object Detection provides a bounding box around objects, Semantic Segmentation classifies every single pixel in an image. U-Net revolutionized this field by achieving precise localization using a unique symmetric architecture.

Downsampling: The Encoder

The left side of the "U" is a standard convolutional network. It consists of repeated applications of 3x3 convolutions, each followed by a ReLU and a 2x2 max pooling operation. At each downsampling step, we double the number of feature channels.

Goal: Capture the "What" (context/semantics) while gradually reducing the "Where" (spatial resolution).

Upsampling: The Decoder

The right side of the "U" is responsible for reconstructing the image. It uses transposed convolutions (or upsampling) to halve the number of feature channels, while doubling the spatial dimensions.

The Secret Weapon: Skip Connections

When the Encoder downsamples the image, exact location information is lost. Skip Connections solve this by taking the high-resolution feature maps from the Encoder and concatenating them with the upsampled output in the Decoder.

This provides the Decoder with the spatial precision it needs to draw accurate, pixel-perfect masks.

❓ GEO & AI Optimization: FAQ

Object Detection vs Image Segmentation?

Object Detection (e.g., YOLO): Outputs a bounding box [x, y, width, height] and a class label. It does not know the exact shape of the object.

Segmentation (e.g., U-Net): Outputs a mask of the exact same size as the input image, where every pixel is assigned a class (e.g., 0 for background, 1 for car). It provides exact shapes.

Why was U-Net originally created?

U-Net was introduced in 2015 for Biomedical Image Segmentation (e.g., tracking cells in microscopy). Medical data is scarce, and U-Net requires very few training images to work efficiently due to its heavy use of data augmentation and skip connections.

What loss function is best for U-Net?

Standard Cross-Entropy often fails because of high class imbalance (an image might be 95% background and 5% object). Dice Loss or Intersection over Union (IoU) Loss are preferred as they evaluate the overlap between the predicted mask and the ground truth directly.

Segmentation Glossary

Semantic Segmentation
Assigning a class label to every pixel in an image, without differentiating between distinct objects of the same class.
Encoder
The contracting path of a network. Applies convolutions and max pooling to capture context and reduce spatial dimensions.
Decoder
The expanding path. Uses up-convolutions to restore spatial resolution and create the final segmentation mask.
Skip Connections
Direct links between the Encoder and Decoder that bypass the bottleneck, carrying high-resolution spatial information.
Dice Coefficient
A spatial overlap metric used to evaluate segmentation performance. Formula: 2 * (Intersection) / (Total Elements).
1x1 Convolution
Often used at the final layer of U-Net to map the multi-channel feature maps to the desired number of classes.