Recognizing a face is one thing; locating it in a crowded street is another. Object detection is the AI's ability to perceive the geometry of the world.
1Beyond Classification
Standard image classification is excellent at answering one question: "What is in this image?" However, when a self-driving car looks at a busy intersection, just knowing "there is a pedestrian" isn't enough. It needs to know *exactly where* that pedestrian is.
Object Detection solves this by finding the coordinates of the object. It draws a Bounding Box around the item, defined by its X and Y center coordinates, its width, and its height. This dual taskโidentifying the class (Classification) and finding the coordinates (Localization)โis what gives AI true spatial awareness.
# Classification vs Detection
# Classification output: "Dog" (99%)
# Detection output:
# "Dog" at [X: 120, Y: 45, W: 200, H: 180]
# "Cat" at [X: 400, Y: 90, W: 150, H: 120]2The YOLO Revolution
In the early days of computer vision, detection was incredibly slow. Algorithms like R-CNN would scan an image thousands of times, looking at tiny cropped regions one by one to see if an object was there.
Then came YOLO (You Only Look Once). YOLO completely reframed the problem. Instead of scanning piece by piece, it passes the entire image through the neural network exactly one time. It treats detection as a single massive math problem (a regression problem), predicting all bounding boxes and class probabilities simultaneously. This made real-time video detection possible.
from ultralytics import YOLO
# Load YOLOv8 Nano (Fastest model)
model = YOLO('yolov8n.pt')
# Detect objects in a single pass
results = model.predict('street_view.jpg')3Image Division
How does YOLO look at everything at once? It divides the input image into a grid (e.g., 13 x 13).
Each individual cell in that grid is responsible for predicting a certain number of bounding boxes, but *only* if the center of an object falls directly inside that cell. The cell predicts the box coordinates and calculates a confidence score (how certain it is that an object exists there). If multiple objects are in the image, different grid cells take responsibility for detecting them in parallel.
"""
YOLO Grid Logic:
1. Divide image into S x S grid.
2. Is object center in cell (3,4)?
3. If yes, cell (3,4) predicts the box.
"""4Intersection over Union (IoU)
When training a detection model, you need a way to grade its homework. If the human drew a box around a car, and the AI drew a slightly different box, how do you score the AI?
We use Intersection over Union (IoU). This metric calculates the area where the two boxes overlap (Intersection) and divides it by the total area covered by both boxes combined (Union). An IoU of 0.0 means no overlap, while 1.0 means a perfect match. Usually, anything above 0.5 is considered a successful detection.
def calculate_iou(boxA, boxB):
# Area of overlap / Total Area
# Target: > 0.5 for a 'hit'
pass5Non-Maximum Suppression
YOLO is so fast that it often gets over-excited. If there is a dog in the image, YOLO might draw five slightly different bounding boxes around the exact same dog because several neighboring grid cells all thought they detected it.
To clean this up, the model uses Non-Maximum Suppression (NMS). NMS looks at all overlapping boxes for the same class. It keeps the box with the highest confidence score and deletes (suppresses) the rest. This ensures the final output has exactly one clean box per object.
# Non-Maximum Suppression (NMS)
# Input: 5 boxes for the same dog
# Output: 1 best box (highest confidence)
# The rest are deleted.