Understanding YOLO-World%3A Real-Time Open-Vocabulary Object Detection

YOLO-World is an innovative approach to enhance the YOLO series of detectors with open-vocabulary detection capabilities through vision-language modeling and pre-training on large-scale datasets. The authors propose a new Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and a region-text contrastive loss to facilitate the interaction between visual and linguistic information. YOLO-World excels in detecting a wide range of objects in a zero-shot manner with high efficiency, achieving 55.4 AP with 52.0 FPS on the LVIS dataset, outperforming many state-of-the-art methods in both accuracy and speed. The pre-trained YOLO-World can be easily adapted to downstream tasks such as open-vocabulary instance segmentation and referring object detection. The paper also introduces a *prompt-then-detect* paradigm to improve the efficiency of open-vocabulary object detection in real-world scenarios. The main contributions include the introduction of YOLO-World, the proposed RepVL-PAN, and the effectiveness of the pre-training scheme. The experimental results demonstrate the superior performance of YOLO-World in terms of speed and open-vocabulary performance, highlighting the benefits of vision-language pre-training on small models.YOLO-World is an innovative approach to enhance the YOLO series of detectors with open-vocabulary detection capabilities through vision-language modeling and pre-training on large-scale datasets. The authors propose a new Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and a region-text contrastive loss to facilitate the interaction between visual and linguistic information. YOLO-World excels in detecting a wide range of objects in a zero-shot manner with high efficiency, achieving 55.4 AP with 52.0 FPS on the LVIS dataset, outperforming many state-of-the-art methods in both accuracy and speed. The pre-trained YOLO-World can be easily adapted to downstream tasks such as open-vocabulary instance segmentation and referring object detection. The paper also introduces a *prompt-then-detect* paradigm to improve the efficiency of open-vocabulary object detection in real-world scenarios. The main contributions include the introduction of YOLO-World, the proposed RepVL-PAN, and the effectiveness of the pre-training scheme. The experimental results demonstrate the superior performance of YOLO-World in terms of speed and open-vocabulary performance, highlighting the benefits of vision-language pre-training on small models.

YOLO-World: Real-Time Open-Vocabulary Object Detection

22 Feb 2024 | Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, Ying Shan