YOLO-World: Real-Time Open-Vocabulary Object Detection

YOLO-World: Real-Time Open-Vocabulary Object Detection

22 Feb 2024 | Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, Ying Shan
YOLO-World is a real-time open-vocabulary object detection model that enhances the traditional YOLO framework with vision-language modeling and pre-training on large-scale datasets. It introduces a Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and region-text contrastive loss to improve the interaction between visual and linguistic information. YOLO-World achieves high efficiency and strong zero-shot performance, with 35.4 AP on the LVIS dataset at 52.0 FPS on a V100 GPU. It outperforms many state-of-the-art methods in terms of both accuracy and speed. The model is also effective for downstream tasks such as open-vocabulary instance segmentation and referring object detection. YOLO-World is pre-trained on large-scale detection, grounding, and image-text datasets, enabling it to detect a wide range of objects without prior training. The model is efficient, easy to deploy, and can be adapted to various downstream tasks. It uses a prompt-then-detect paradigm to improve efficiency in real-world scenarios. The model's pre-trained weights and code are open-sourced for practical applications. YOLO-World demonstrates strong performance in open-vocabulary detection and is a promising solution for real-world applications.YOLO-World is a real-time open-vocabulary object detection model that enhances the traditional YOLO framework with vision-language modeling and pre-training on large-scale datasets. It introduces a Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and region-text contrastive loss to improve the interaction between visual and linguistic information. YOLO-World achieves high efficiency and strong zero-shot performance, with 35.4 AP on the LVIS dataset at 52.0 FPS on a V100 GPU. It outperforms many state-of-the-art methods in terms of both accuracy and speed. The model is also effective for downstream tasks such as open-vocabulary instance segmentation and referring object detection. YOLO-World is pre-trained on large-scale detection, grounding, and image-text datasets, enabling it to detect a wide range of objects without prior training. The model is efficient, easy to deploy, and can be adapted to various downstream tasks. It uses a prompt-then-detect paradigm to improve efficiency in real-world scenarios. The model's pre-trained weights and code are open-sourced for practical applications. YOLO-World demonstrates strong performance in open-vocabulary detection and is a promising solution for real-world applications.
Reach us at info@study.space