Object Detection is the combination of image classification and localization. It enables machines to identify multiple objects in a scene and pinpoint their exact coordinates.
1Localizing Features
Image classification tells us WHAT is in an image. Object Detection takes it a massive step further, telling us WHAT and exactly WHERE. Welcome to the world of spatial localization.
While simple classification outputs a single text label, Object Detection outputs a geometric 'Bounding Box' for every single instance it finds. This defines a precise rectangular boundary around the target using coordinates like [x, y, width, height].
# Object Detection Output
# Format: [x, y, width, height]
# Bounding boxes define the absolute limits of an object.2You Only Look Once (YOLO)
Early detection algorithms used slow 'Sliding Windows'. Modern AI uses 'Single Shot' detectors like YOLO (You Only Look Once) that process the entire image matrix in a single forward pass, making them incredibly fast.
When you run an image through YOLO, the network divides the image into a grid. Each individual cell in that grid is responsible for predicting bounding boxes and a 'Confidence Score' for whatever is located near its center.
import torch
# Load YOLOv5 model architecture from TorchHub
# 'yolov5s' is the Small, fast version for real-time video
model = torch.hub.load('ultralytics/yolov5', 'yolov5s')
results = model('street_scene.jpg')3Intersection over Union (IoU)
To measure how perfectly a predicted bounding box aligns with the actual object, we use a math formula called Intersection over Union (IoU).
It divides the overlapping area (Intersection) by the total combined area (Union). A perfect IoU score is 1.0, meaning the predicted box and the ground truth are identical. This metric is crucial for training and evaluating object detection models.
# IoU Calculation Concept
# Intersection: Area where predicted box overlaps true box
# Union: Total area covered by both boxes combined
# IoU = Area of Overlap / Area of Union4Non-Maximum Suppression (NMS)
Because the grid outputs hundreds of predictions, YOLO often draws many overlapping boxes around the same object. We use an algorithm called Non-Maximum Suppression (NMS) to violently clean up this noisy mess.
NMS sorts all predictions by Confidence Score, keeps the box with the highest confidence, and discards any nearby boxes that have a high IoU with it. This deletes duplicate, overlapping bounding boxes and ensures each object is labeled exactly once.
# Non-Maximum Suppression (NMS) Workflow
# 1. Sort all predictions by Confidence Score
# 2. Keep the box with the highest confidence
# 3. Discard any nearby boxes that have a high IoU5Confidence Thresholding
With high IoU thresholds and aggressive NMS filtering, we get perfectly clean bounding boxes. This pipeline is exactly what powers the real-time collision detection systems in autonomous self-driving cars.
Every detected object also outputs a 'Confidence Score' between 0.0 and 1.0. If the system is building a safety application, we might ignore any bounding box that has less than 0.85 confidence to prevent false alarms (false positives).
# Confidence Thresholding
# Only trust detections above 85% certainty
for detection in results.pred[0]:
confidence = detection[4]
if confidence < 0.85:
continue # Ignore weak predictions