[slides and audio] DetCLIPv3%3A Towards Versatile Generative Open-Vocabulary Object Detection

**DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection** **Authors:** Lewei Yao, Renjie Pi, Jianhua Han, Xiaodan Liang, Hang Xu, Wei Zhang, Zhenguo Li, Dan Xu **Institution:** Hong Kong University of Science and Technology, Huawei Noah's Ark Lab, Shenzhen Campus of Sun Yat-Sen University **Abstract:** This paper introduces DetCLIPv3, a high-performing detector that excels in both open-vocabulary object detection and generating hierarchical labels for detected objects. DetCLIPv3 is characterized by three core designs: 1) A versatile model architecture that integrates an object captioner to enhance generative capabilities; 2) High information density data through an auto-annotation pipeline using visual large language models to refine captions for large-scale image-text pairs; 3) An efficient multi-stage training strategy that leverages low-resolution inputs for initial learning followed by fine-tuning on high-resolution samples. DetCLIPv3 demonstrates superior performance in zero-shot fixed AP on the LVIS minimal benchmark, outperforming previous methods by significant margins. It also achieves state-of-the-art results in dense captioning tasks, showcasing its strong generative capability. **Key Contributions:** 1. **Versatile Model Architecture:** DetCLIPv3 is designed to handle both open-vocabulary object detection and hierarchical label generation. 2. **High Information Density Data:** An auto-annotation pipeline using visual large language models refines captions, providing rich, multi-granular object labels. 3. **Efficient Training Strategy:** A multi-stage training approach that initializes with low-resolution inputs and fine-tunes with high-resolution samples, optimizing training efficiency. **Experimental Results:** - **Zero-Shot Fixed AP:** DetCLIPv3 achieves 47.0 AP on the LVIS minimal benchmark, significantly outperforming GLIPv2, GroundingDINO, and DetCLIPv2. - **Dense Captioning:** Achieves 19.7 AP on the VG dataset, surpassing the previous best method GRiT by 2.9 mAP. **Conclusion:** DetCLIPv3 enhances the scope of open-vocabulary object detection by enabling comprehensive and hierarchical object labeling, expanding its application scenarios.**DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection** **Authors:** Lewei Yao, Renjie Pi, Jianhua Han, Xiaodan Liang, Hang Xu, Wei Zhang, Zhenguo Li, Dan Xu **Institution:** Hong Kong University of Science and Technology, Huawei Noah's Ark Lab, Shenzhen Campus of Sun Yat-Sen University **Abstract:** This paper introduces DetCLIPv3, a high-performing detector that excels in both open-vocabulary object detection and generating hierarchical labels for detected objects. DetCLIPv3 is characterized by three core designs: 1) A versatile model architecture that integrates an object captioner to enhance generative capabilities; 2) High information density data through an auto-annotation pipeline using visual large language models to refine captions for large-scale image-text pairs; 3) An efficient multi-stage training strategy that leverages low-resolution inputs for initial learning followed by fine-tuning on high-resolution samples. DetCLIPv3 demonstrates superior performance in zero-shot fixed AP on the LVIS minimal benchmark, outperforming previous methods by significant margins. It also achieves state-of-the-art results in dense captioning tasks, showcasing its strong generative capability. **Key Contributions:** 1. **Versatile Model Architecture:** DetCLIPv3 is designed to handle both open-vocabulary object detection and hierarchical label generation. 2. **High Information Density Data:** An auto-annotation pipeline using visual large language models refines captions, providing rich, multi-granular object labels. 3. **Efficient Training Strategy:** A multi-stage training approach that initializes with low-resolution inputs and fine-tunes with high-resolution samples, optimizing training efficiency. **Experimental Results:** - **Zero-Shot Fixed AP:** DetCLIPv3 achieves 47.0 AP on the LVIS minimal benchmark, significantly outperforming GLIPv2, GroundingDINO, and DetCLIPv2. - **Dense Captioning:** Achieves 19.7 AP on the VG dataset, surpassing the previous best method GRiT by 2.9 mAP. **Conclusion:** DetCLIPv3 enhances the scope of open-vocabulary object detection by enabling comprehensive and hierarchical object labeling, expanding its application scenarios.

DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection

14 Apr 2024 | Lewei Yao1,2, Renjie Pi1, Jianhua Han2, Xiaodan Liang3, Hang Xu2†, Wei Zhang2, Zhenguo Li2, Dan Xu1