14 Apr 2024 | Lewei Yao, Renjie Pi, Jianhua Han, Xiaodan Liang, Hang Xu, Wei Zhang, Zhenguo Li, Dan Xu
DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection
DetCLIPv3 is a high-performing open-vocabulary object detector that not only detects objects but also generates hierarchical labels for them. It is characterized by three core designs: a versatile model architecture, high information density data, and an efficient training strategy. The model is based on a robust open-vocabulary detector, which is further enhanced with a captioner to provide generative capabilities. The captioner leverages foreground proposals from the detector and is trained to generate hierarchical labels for detected objects through a language modeling objective. This design allows for accurate localization and detailed descriptions of visual concepts.
The model uses an auto-annotation pipeline leveraging visual large language models to refine captions for large-scale image-text pairs, providing rich, multi-granular object labels to enhance training. The training strategy involves pre-training with low-resolution inputs to enable the object captioner to learn a broad spectrum of visual concepts, followed by fine-tuning with high-resolution samples to further enhance detection performance. DetCLIPv3 achieves superior open-vocabulary detection performance, such as a 47.0 zero-shot fixed AP on the LVIS minival benchmark, outperforming previous methods by significant margins. It also achieves a state-of-the-art 19.7 AP in dense captioning on the VG dataset.
The model's ability to generate hierarchical labels for objects, even in the absence of predefined categories, offers two advantages: applicability without appropriate input categories and a comprehensive, hierarchical description of objects. The model's design includes a versatile architecture, high information density data, and efficient training strategy. The auto-annotation pipeline uses visual large language models to generate refined captions with rich hierarchical object labels. The training strategy involves pre-training with low-resolution inputs and fine-tuning with high-resolution samples.
DetCLIPv3 demonstrates strong generative capabilities, achieving significant improvements in open-vocabulary detection and dense captioning tasks. It outperforms previous methods in both tasks, showing its effectiveness in generating hierarchical labels and improving detection performance. The model's performance is validated through extensive experiments, demonstrating its superior domain generalization and downstream transferability. The model's ability to generate hierarchical labels for objects, even without predefined categories, highlights its versatility and effectiveness in open-vocabulary object detection.DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection
DetCLIPv3 is a high-performing open-vocabulary object detector that not only detects objects but also generates hierarchical labels for them. It is characterized by three core designs: a versatile model architecture, high information density data, and an efficient training strategy. The model is based on a robust open-vocabulary detector, which is further enhanced with a captioner to provide generative capabilities. The captioner leverages foreground proposals from the detector and is trained to generate hierarchical labels for detected objects through a language modeling objective. This design allows for accurate localization and detailed descriptions of visual concepts.
The model uses an auto-annotation pipeline leveraging visual large language models to refine captions for large-scale image-text pairs, providing rich, multi-granular object labels to enhance training. The training strategy involves pre-training with low-resolution inputs to enable the object captioner to learn a broad spectrum of visual concepts, followed by fine-tuning with high-resolution samples to further enhance detection performance. DetCLIPv3 achieves superior open-vocabulary detection performance, such as a 47.0 zero-shot fixed AP on the LVIS minival benchmark, outperforming previous methods by significant margins. It also achieves a state-of-the-art 19.7 AP in dense captioning on the VG dataset.
The model's ability to generate hierarchical labels for objects, even in the absence of predefined categories, offers two advantages: applicability without appropriate input categories and a comprehensive, hierarchical description of objects. The model's design includes a versatile architecture, high information density data, and efficient training strategy. The auto-annotation pipeline uses visual large language models to generate refined captions with rich hierarchical object labels. The training strategy involves pre-training with low-resolution inputs and fine-tuning with high-resolution samples.
DetCLIPv3 demonstrates strong generative capabilities, achieving significant improvements in open-vocabulary detection and dense captioning tasks. It outperforms previous methods in both tasks, showing its effectiveness in generating hierarchical labels and improving detection performance. The model's performance is validated through extensive experiments, demonstrating its superior domain generalization and downstream transferability. The model's ability to generate hierarchical labels for objects, even without predefined categories, highlights its versatility and effectiveness in open-vocabulary object detection.