An Open and Comprehensive Pipeline for Unified Object Grounding and Detection

An Open and Comprehensive Pipeline for Unified Object Grounding and Detection

5 Jan 2024 | Xiangyu Zhao, Yicheng Chen, Shilin Xu, Xiangtai Li, Xinjiang Wang, Yining Li, Haian Huang
This paper introduces MM-Grounding-DINO, an open-source and comprehensive pipeline for unified object grounding and detection. It is based on the Grounding-DINO model and is pretrained on a wide range of vision datasets. The model is designed to address three key tasks: Open-Vocabulary Detection (OVD), Phrase Grounding (PG), and Referring Expression Comprehension (REC). The model is built using the MMDetection toolbox and is available for research and development. MM-Grounding-DINO is trained on a variety of datasets, including COCO, Objects365, GRIT, V3Det, Ref-COCO, RefCOCO+, RefCOCOg, GQA, and Flickr30K Entities. It is evaluated on multiple benchmarks, including COCO, LVIS, RefCOCO, RefCOCO+, RefCOCOg, gRefCOCO, and Description Detection Dataset (D³). The results show that MM-Grounding-DINO outperforms the Grounding-DINO baseline in most tasks, particularly in the REC task. The model is trained using a combination of pretraining and fine-tuning. The pretraining phase involves training on a large number of vision datasets, while the fine-tuning phase involves training on specific datasets for each task. The model is evaluated on various downstream tasks, including object detection in hazy and underwater conditions, brain tumor detection, and cityscapes. The results show that the model performs well in these tasks, with significant improvements in performance after fine-tuning. The model is also evaluated on the People in Paintings dataset, where it outperforms fine-tuned models in a zero-shot setting. The model is further evaluated on the Brain Tumor dataset, where it performs slightly worse than the Cascade-DINO model, but this is attributed to the challenges posed by the dataset's reliance on purely numerical labels. Overall, MM-Grounding-DINO is a comprehensive and open-source pipeline for unified object grounding and detection, which has demonstrated strong performance across a wide range of tasks and datasets. The model is available for research and development, and its code and trained models are released for public use.This paper introduces MM-Grounding-DINO, an open-source and comprehensive pipeline for unified object grounding and detection. It is based on the Grounding-DINO model and is pretrained on a wide range of vision datasets. The model is designed to address three key tasks: Open-Vocabulary Detection (OVD), Phrase Grounding (PG), and Referring Expression Comprehension (REC). The model is built using the MMDetection toolbox and is available for research and development. MM-Grounding-DINO is trained on a variety of datasets, including COCO, Objects365, GRIT, V3Det, Ref-COCO, RefCOCO+, RefCOCOg, GQA, and Flickr30K Entities. It is evaluated on multiple benchmarks, including COCO, LVIS, RefCOCO, RefCOCO+, RefCOCOg, gRefCOCO, and Description Detection Dataset (D³). The results show that MM-Grounding-DINO outperforms the Grounding-DINO baseline in most tasks, particularly in the REC task. The model is trained using a combination of pretraining and fine-tuning. The pretraining phase involves training on a large number of vision datasets, while the fine-tuning phase involves training on specific datasets for each task. The model is evaluated on various downstream tasks, including object detection in hazy and underwater conditions, brain tumor detection, and cityscapes. The results show that the model performs well in these tasks, with significant improvements in performance after fine-tuning. The model is also evaluated on the People in Paintings dataset, where it outperforms fine-tuned models in a zero-shot setting. The model is further evaluated on the Brain Tumor dataset, where it performs slightly worse than the Cascade-DINO model, but this is attributed to the challenges posed by the dataset's reliance on purely numerical labels. Overall, MM-Grounding-DINO is a comprehensive and open-source pipeline for unified object grounding and detection, which has demonstrated strong performance across a wide range of tasks and datasets. The model is available for research and development, and its code and trained models are released for public use.
Reach us at info@study.space
Understanding An Open and Comprehensive Pipeline for Unified Object Grounding and Detection