In object detection, Anchor Boxes are a set of predefined bounding boxes with specific height and width ratios. These boxes serve as reference frames, tiled across the image at different scales and aspect ratios. Instead of predicting the exact coordinates of an object from scratch, a model predicts the 'offsets'—adjustments in position (x, y) and size (width, height)—from these anchor boxes to match the target object. This approach simplifies the learning process by converting the difficult problem of coordinate prediction into a more manageable one of predicting small adjustments, and it enables models to detect multiple objects of varying shapes and sizes in a single pass.
Introduced in the Faster R-CNN paper by Ren et al. in 2015, anchor boxes were developed to create a more efficient method for generating region proposals, which was a significant bottleneck in earlier object detection models.
Anchor boxes became a foundational concept in many modern object detection models, including both two-stage detectors (like the R-CNN family) and one-stage detectors (like YOLO and SSD). While their use is widespread, a newer class of 'anchor-free' detectors (such as FCOS and CornerNet) has emerged, which predict object locations without relying on predefined boxes.