The paper "LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-Grained Descriptors" introduces DVDet, a novel approach to enhance open vocabulary object detection (OVOD) by leveraging the fine-grained textual descriptions of object parts and attributes. Inspired by the zero-shot capabilities of vision language models (VLMs), the authors propose a Descriptor-Enhanced Open Vocabulary Detector that uses conditional context prompts and hierarchical textual descriptors to improve region-text alignment. The key contributions include:
1. **Conditional Context Regional Prompts (CCP)**: This technique transforms region embeddings into image-like representations by integrating contextual background information, enabling seamless integration into existing open vocabulary detectors.
2. **Hierarchical Descriptor Generation**: A mechanism that iteratively interacts with large language models (LLMs) to generate and refine fine-grained descriptors, enhancing the precision of region-text alignment.
Experiments on multiple benchmarks, including COCO and LVIS, demonstrate that DVDet consistently outperforms state-of-the-art methods, particularly in handling novel classes and challenging scenarios with distant or occluded objects. The paper also includes ablation studies and a transfer experiment to other datasets, validating the effectiveness and generalization of the proposed method.The paper "LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-Grained Descriptors" introduces DVDet, a novel approach to enhance open vocabulary object detection (OVOD) by leveraging the fine-grained textual descriptions of object parts and attributes. Inspired by the zero-shot capabilities of vision language models (VLMs), the authors propose a Descriptor-Enhanced Open Vocabulary Detector that uses conditional context prompts and hierarchical textual descriptors to improve region-text alignment. The key contributions include:
1. **Conditional Context Regional Prompts (CCP)**: This technique transforms region embeddings into image-like representations by integrating contextual background information, enabling seamless integration into existing open vocabulary detectors.
2. **Hierarchical Descriptor Generation**: A mechanism that iteratively interacts with large language models (LLMs) to generate and refine fine-grained descriptors, enhancing the precision of region-text alignment.
Experiments on multiple benchmarks, including COCO and LVIS, demonstrate that DVDet consistently outperforms state-of-the-art methods, particularly in handling novel classes and challenging scenarios with distant or occluded objects. The paper also includes ablation studies and a transfer experiment to other datasets, validating the effectiveness and generalization of the proposed method.