Object Detection & YOLO Architecture
Object detection goes beyond simply telling you what's in an image; it tells you exactly where it is by drawing a bounding box.
You Only Look Once (YOLO)
Traditional models like R-CNN use two stages: first finding regions of interest, then classifying them. YOLO processes the entire image in a single neural network pass, making it incredibly fast and capable of real-time detection.
Intersection over Union (IoU)
To know if a predicted bounding box is correct, we calculate the overlap with the true bounding box. IoU divides the area of overlap by the total combined area. An IoU above 0.5 is typically considered a good prediction.
Non-Maximum Suppression (NMS)
YOLO might detect the same object multiple times from different grid cells. NMS cleans this up by keeping the box with the highest confidence and removing any heavily overlapping boxes.
View Deep Dive+
In YOLO, the image is divided into an S x S grid. Each grid cell predicts B bounding boxes and confidence scores for those boxes, as well as C class probabilities. The final output tensor is of size S x S x (B * 5 + C). This unified architecture ensures that YOLO understands the global context of the image, reducing background errors compared to sliding window approaches.
