14 Apr 2024 | Lewei Yao1,2, Renjie Pi1, Jianhua Han2, Xiaodan Liang3, Hang Xu2†, Wei Zhang2, Zhenguo Li2, Dan Xu1
**DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection**
**Authors:** Lewei Yao, Renjie Pi, Jianhua Han, Xiaodan Liang, Hang Xu, Wei Zhang, Zhenguo Li, Dan Xu
**Institution:** Hong Kong University of Science and Technology, Huawei Noah's Ark Lab, Shenzhen Campus of Sun Yat-Sen University
**Abstract:**
This paper introduces DetCLIPv3, a high-performing detector that excels in both open-vocabulary object detection and generating hierarchical labels for detected objects. DetCLIPv3 is characterized by three core designs: 1) A versatile model architecture that integrates an object captioner to enhance generative capabilities; 2) High information density data through an auto-annotation pipeline using visual large language models to refine captions for large-scale image-text pairs; 3) An efficient multi-stage training strategy that leverages low-resolution inputs for initial learning followed by fine-tuning on high-resolution samples. DetCLIPv3 demonstrates superior performance in zero-shot fixed AP on the LVIS minimal benchmark, outperforming previous methods by significant margins. It also achieves state-of-the-art results in dense captioning tasks, showcasing its strong generative capability.
**Key Contributions:**
1. **Versatile Model Architecture:** DetCLIPv3 is designed to handle both open-vocabulary object detection and hierarchical label generation.
2. **High Information Density Data:** An auto-annotation pipeline using visual large language models refines captions, providing rich, multi-granular object labels.
3. **Efficient Training Strategy:** A multi-stage training approach that initializes with low-resolution inputs and fine-tunes with high-resolution samples, optimizing training efficiency.
**Experimental Results:**
- **Zero-Shot Fixed AP:** DetCLIPv3 achieves 47.0 AP on the LVIS minimal benchmark, significantly outperforming GLIPv2, GroundingDINO, and DetCLIPv2.
- **Dense Captioning:** Achieves 19.7 AP on the VG dataset, surpassing the previous best method GRiT by 2.9 mAP.
**Conclusion:**
DetCLIPv3 enhances the scope of open-vocabulary object detection by enabling comprehensive and hierarchical object labeling, expanding its application scenarios.**DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection**
**Authors:** Lewei Yao, Renjie Pi, Jianhua Han, Xiaodan Liang, Hang Xu, Wei Zhang, Zhenguo Li, Dan Xu
**Institution:** Hong Kong University of Science and Technology, Huawei Noah's Ark Lab, Shenzhen Campus of Sun Yat-Sen University
**Abstract:**
This paper introduces DetCLIPv3, a high-performing detector that excels in both open-vocabulary object detection and generating hierarchical labels for detected objects. DetCLIPv3 is characterized by three core designs: 1) A versatile model architecture that integrates an object captioner to enhance generative capabilities; 2) High information density data through an auto-annotation pipeline using visual large language models to refine captions for large-scale image-text pairs; 3) An efficient multi-stage training strategy that leverages low-resolution inputs for initial learning followed by fine-tuning on high-resolution samples. DetCLIPv3 demonstrates superior performance in zero-shot fixed AP on the LVIS minimal benchmark, outperforming previous methods by significant margins. It also achieves state-of-the-art results in dense captioning tasks, showcasing its strong generative capability.
**Key Contributions:**
1. **Versatile Model Architecture:** DetCLIPv3 is designed to handle both open-vocabulary object detection and hierarchical label generation.
2. **High Information Density Data:** An auto-annotation pipeline using visual large language models refines captions, providing rich, multi-granular object labels.
3. **Efficient Training Strategy:** A multi-stage training approach that initializes with low-resolution inputs and fine-tunes with high-resolution samples, optimizing training efficiency.
**Experimental Results:**
- **Zero-Shot Fixed AP:** DetCLIPv3 achieves 47.0 AP on the LVIS minimal benchmark, significantly outperforming GLIPv2, GroundingDINO, and DetCLIPv2.
- **Dense Captioning:** Achieves 19.7 AP on the VG dataset, surpassing the previous best method GRiT by 2.9 mAP.
**Conclusion:**
DetCLIPv3 enhances the scope of open-vocabulary object detection by enabling comprehensive and hierarchical object labeling, expanding its application scenarios.