[slides and audio] LLMs Meet VLMs%3A Boost Open Vocabulary Object Detection with Fine-grained Descriptors

The paper "LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-Grained Descriptors" introduces DVDet, a novel approach to enhance open vocabulary object detection (OVOD) by leveraging the fine-grained textual descriptions of object parts and attributes. Inspired by the zero-shot capabilities of vision language models (VLMs), the authors propose a Descriptor-Enhanced Open Vocabulary Detector that uses conditional context prompts and hierarchical textual descriptors to improve region-text alignment. The key contributions include: 1. **Conditional Context Regional Prompts (CCP)**: This technique transforms region embeddings into image-like representations by integrating contextual background information, enabling seamless integration into existing open vocabulary detectors. 2. **Hierarchical Descriptor Generation**: A mechanism that iteratively interacts with large language models (LLMs) to generate and refine fine-grained descriptors, enhancing the precision of region-text alignment. Experiments on multiple benchmarks, including COCO and LVIS, demonstrate that DVDet consistently outperforms state-of-the-art methods, particularly in handling novel classes and challenging scenarios with distant or occluded objects. The paper also includes ablation studies and a transfer experiment to other datasets, validating the effectiveness and generalization of the proposed method.The paper "LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-Grained Descriptors" introduces DVDet, a novel approach to enhance open vocabulary object detection (OVOD) by leveraging the fine-grained textual descriptions of object parts and attributes. Inspired by the zero-shot capabilities of vision language models (VLMs), the authors propose a Descriptor-Enhanced Open Vocabulary Detector that uses conditional context prompts and hierarchical textual descriptors to improve region-text alignment. The key contributions include: 1. **Conditional Context Regional Prompts (CCP)**: This technique transforms region embeddings into image-like representations by integrating contextual background information, enabling seamless integration into existing open vocabulary detectors. 2. **Hierarchical Descriptor Generation**: A mechanism that iteratively interacts with large language models (LLMs) to generate and refine fine-grained descriptors, enhancing the precision of region-text alignment. Experiments on multiple benchmarks, including COCO and LVIS, demonstrate that DVDet consistently outperforms state-of-the-art methods, particularly in handling novel classes and challenging scenarios with distant or occluded objects. The paper also includes ablation studies and a transfer experiment to other datasets, validating the effectiveness and generalization of the proposed method.

LLMs MEET VLMs: BOOST OPEN VOCABULARY OBJECT DETECTION WITH FINE-GRAINED DESCRIPTORS

7 Feb 2024 | Sheng Jin1, Xueying Jiang1, Jiaxing Huang1, Lewei Lu2, Shijian Lu1*