LLMs MEET VLMs: BOOST OPEN VOCABULARY OBJECT DETECTION WITH FINE-GRAINED DESCRIPTORS

LLMs MEET VLMs: BOOST OPEN VOCABULARY OBJECT DETECTION WITH FINE-GRAINED DESCRIPTORS

2024 | Sheng Jin, Xueying Jiang, Jiaxing Huang, Lewei Lu, Shijian Lu
This paper introduces DVDet, a Descriptor-Enhanced Open Vocabulary Detector that improves open vocabulary object detection by leveraging fine-grained descriptors. DVDet integrates conditional context prompts and hierarchical textual descriptors to achieve precise region-text alignment and enhance open vocabulary detection. The conditional context prompt transforms region embeddings into image-like representations, enabling seamless integration into open vocabulary detection training. Additionally, large language models are used as an interactive knowledge repository to iteratively refine visually oriented textual descriptors for precise region-text alignment. Extensive experiments on multiple benchmarks show that DVDet outperforms state-of-the-art methods consistently. DVDet's key design is the Conditional Context Regional Prompt (CCP), which fuses contextual background information around region proposals to generate image-like representations. A hierarchical descriptor generation mechanism iteratively interacts with large language models to mine and refine fine-grained descriptors, enhancing their diversity and visual relevance. The method is evaluated on COCO and LVIS benchmarks, demonstrating significant improvements in open vocabulary detection performance. DVDet achieves better accuracy on both base and novel categories, and its effectiveness is further validated through ablation studies and transfer experiments to other datasets. The results show that DVDet significantly improves detection performance, especially in challenging scenarios with occluded or distant objects, and reduces ambiguity and misclassifications. The method's ability to generate and refine fine-grained descriptors through interaction with large language models contributes to its effectiveness in open vocabulary object detection.This paper introduces DVDet, a Descriptor-Enhanced Open Vocabulary Detector that improves open vocabulary object detection by leveraging fine-grained descriptors. DVDet integrates conditional context prompts and hierarchical textual descriptors to achieve precise region-text alignment and enhance open vocabulary detection. The conditional context prompt transforms region embeddings into image-like representations, enabling seamless integration into open vocabulary detection training. Additionally, large language models are used as an interactive knowledge repository to iteratively refine visually oriented textual descriptors for precise region-text alignment. Extensive experiments on multiple benchmarks show that DVDet outperforms state-of-the-art methods consistently. DVDet's key design is the Conditional Context Regional Prompt (CCP), which fuses contextual background information around region proposals to generate image-like representations. A hierarchical descriptor generation mechanism iteratively interacts with large language models to mine and refine fine-grained descriptors, enhancing their diversity and visual relevance. The method is evaluated on COCO and LVIS benchmarks, demonstrating significant improvements in open vocabulary detection performance. DVDet achieves better accuracy on both base and novel categories, and its effectiveness is further validated through ablation studies and transfer experiments to other datasets. The results show that DVDet significantly improves detection performance, especially in challenging scenarios with occluded or distant objects, and reduces ambiguity and misclassifications. The method's ability to generate and refine fine-grained descriptors through interaction with large language models contributes to its effectiveness in open vocabulary object detection.
Reach us at info@study.space