Object Detection: YOLO & SSD

While Image Classification answers "What is in this image?", Object Detection answers "What is it, and exactly where is it?" by drawing bounding boxes around targets.

YOLO (You Only Look Once)

Traditional systems repurposed classifiers to perform detection by running a "sliding window" over the image at multiple scales. This approach was highly accurate but notoriously slow.

YOLO fundamentally changed this. It frames object detection as a single regression problem. The image is passed through a convolutional neural network once. The network divides the image into an S x S grid, and each grid cell predicts bounding boxes, confidence scores, and class probabilities simultaneously.

SSD (Single Shot Detector)

SSD also detects objects in a single pass, but handles scales differently. Instead of relying on one feature map layer, SSD adds auxiliary structure to the network to produce predictions from multiple feature maps at different resolutions.

It uses Anchor Boxes (or prior boxes) of different aspect ratios. By predicting adjustments to these anchor boxes rather than absolute coordinates, SSD maintains high speed while improving accuracy on smaller objects.

❓ Core Detection Concepts

What is Intersection over Union (IoU)?

Intersection over Union (IoU) is an evaluation metric used to measure the accuracy of an object detector on a particular dataset. It calculates the area of overlap between the predicted bounding box and the ground-truth bounding box, divided by the area of union. An IoU score > 0.5 is normally considered a "good" prediction.

How does Non-Maximum Suppression (NMS) work?

Because algorithms like YOLO and SSD predict multiple overlapping bounding boxes for the same object, Non-Maximum Suppression (NMS) is applied to clean up the output. NMS works by:

Selecting the bounding box with the highest confidence score.
Removing all other bounding boxes that have a high IoU with the selected box.
Repeating the process until only unique objects remain.

Vision Glossary

Bounding Box

A rectangle that completely encloses an object within an image, defined by [x, y, width, height].

mAP (Mean Average Precision)

The standard metric used to evaluate Object Detection models across all classes.

Anchor Box

Pre-defined boxes of specific height and width used as reference points for predicting actual object boxes.

Confidence Score

A probability value (0 to 1) indicating how certain the model is that a bounding box contains an object.

Object Detection

Detection Pipeline

YOLO Architecture

System Check

Detection Challenges