Object Detection: YOLO & SSD
While Image Classification answers "What is in this image?", Object Detection answers "What is it, and exactly where is it?" by drawing bounding boxes around targets.
YOLO (You Only Look Once)
Traditional systems repurposed classifiers to perform detection by running a "sliding window" over the image at multiple scales. This approach was highly accurate but notoriously slow.
YOLO fundamentally changed this. It frames object detection as a single regression problem. The image is passed through a convolutional neural network once. The network divides the image into an S x S grid, and each grid cell predicts bounding boxes, confidence scores, and class probabilities simultaneously.
SSD (Single Shot Detector)
SSD also detects objects in a single pass, but handles scales differently. Instead of relying on one feature map layer, SSD adds auxiliary structure to the network to produce predictions from multiple feature maps at different resolutions.
It uses Anchor Boxes (or prior boxes) of different aspect ratios. By predicting adjustments to these anchor boxes rather than absolute coordinates, SSD maintains high speed while improving accuracy on smaller objects.
❓ Core Detection Concepts
What is Intersection over Union (IoU)?
Intersection over Union (IoU) is an evaluation metric used to measure the accuracy of an object detector on a particular dataset. It calculates the area of overlap between the predicted bounding box and the ground-truth bounding box, divided by the area of union. An IoU score > 0.5 is normally considered a "good" prediction.
How does Non-Maximum Suppression (NMS) work?
Because algorithms like YOLO and SSD predict multiple overlapping bounding boxes for the same object, Non-Maximum Suppression (NMS) is applied to clean up the output. NMS works by:
- Selecting the bounding box with the highest confidence score.
- Removing all other bounding boxes that have a high IoU with the selected box.
- Repeating the process until only unique objects remain.