Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

19 Jul 2024 | Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang
Grounding DINO is an open-set object detector that combines the DINO transformer-based detector with grounded pre-training. It can detect arbitrary objects using human inputs such as category names or referring expressions. The key solution for open-set detection is introducing language to a closed-set detector for concept generalization. To effectively fuse language and vision modalities, the closed-set detector is conceptually divided into three phases, and a tight fusion solution is proposed, including a feature enhancer, a language-guided query selection, and a cross-modality decoder for modalities fusion. Grounding DINO is pre-trained on large-scale datasets including object detection data, grounding data, and caption data, and evaluated on both open-set object detection and referring object detection benchmarks. Grounding DINO performs well on all three settings, including benchmarks on COCO, LVIS, ODinW, and RefCOCO/+/g. It achieves a 52.5 AP on the COCO zero-shot detection benchmark and sets a new record on the ODinW zero-shot benchmark with a mean 26.1 AP. The model is released with checkpoints and inference codes at https://github.com/IDEA-Research/GroundingDINO. Grounding DINO extends DINO to open-set object detection, enabling it to detect arbitrary objects given texts as queries. It proposes a tight fusion approach to better fuse cross-modality information and a sub-sentence level representation to use detection data for text prompts in a more reasonable way. The results show the effectiveness of the model design and its extension to REC tasks. Grounding DINO outperforms existing open-set detectors in zero-shot performance and is evaluated on various benchmarks including COCO, LVIS, ODinW, and RefCOCO/+/g. The model is also compared with other open-set methods in Table 1. The model is trained on large-scale datasets and evaluated on mainstream object detection benchmarks like COCO. Grounding DINO is more effective in zero-shot transfer and performs well on open-set detection tasks. The model is also compared with GLIP and DetCLIPv2 in various settings. The results show that Grounding DINO outperforms these models in zero-shot transfer and open-set detection. The model is also evaluated on the REC task and shows better performance than GLIP. The model is trained on three types of data: detection data, grounding data, and caption data. The model is pre-trained on O365 and GoldG datasets and evaluated on various benchmarks. The model is also compared with other open-set detectors in terms of performance and scalability. The model is more compact and has better performance consistency compared to GLIPv2. The model is also evaluated on the ODinW benchmark and shows better performance than DINO. The model is also compared with GLIP in the same backbone setting and shows better performance. The model is also evaluated on the RefCOCO/+/g benchmarkGrounding DINO is an open-set object detector that combines the DINO transformer-based detector with grounded pre-training. It can detect arbitrary objects using human inputs such as category names or referring expressions. The key solution for open-set detection is introducing language to a closed-set detector for concept generalization. To effectively fuse language and vision modalities, the closed-set detector is conceptually divided into three phases, and a tight fusion solution is proposed, including a feature enhancer, a language-guided query selection, and a cross-modality decoder for modalities fusion. Grounding DINO is pre-trained on large-scale datasets including object detection data, grounding data, and caption data, and evaluated on both open-set object detection and referring object detection benchmarks. Grounding DINO performs well on all three settings, including benchmarks on COCO, LVIS, ODinW, and RefCOCO/+/g. It achieves a 52.5 AP on the COCO zero-shot detection benchmark and sets a new record on the ODinW zero-shot benchmark with a mean 26.1 AP. The model is released with checkpoints and inference codes at https://github.com/IDEA-Research/GroundingDINO. Grounding DINO extends DINO to open-set object detection, enabling it to detect arbitrary objects given texts as queries. It proposes a tight fusion approach to better fuse cross-modality information and a sub-sentence level representation to use detection data for text prompts in a more reasonable way. The results show the effectiveness of the model design and its extension to REC tasks. Grounding DINO outperforms existing open-set detectors in zero-shot performance and is evaluated on various benchmarks including COCO, LVIS, ODinW, and RefCOCO/+/g. The model is also compared with other open-set methods in Table 1. The model is trained on large-scale datasets and evaluated on mainstream object detection benchmarks like COCO. Grounding DINO is more effective in zero-shot transfer and performs well on open-set detection tasks. The model is also compared with GLIP and DetCLIPv2 in various settings. The results show that Grounding DINO outperforms these models in zero-shot transfer and open-set detection. The model is also evaluated on the REC task and shows better performance than GLIP. The model is trained on three types of data: detection data, grounding data, and caption data. The model is pre-trained on O365 and GoldG datasets and evaluated on various benchmarks. The model is also compared with other open-set detectors in terms of performance and scalability. The model is more compact and has better performance consistency compared to GLIPv2. The model is also evaluated on the ODinW benchmark and shows better performance than DINO. The model is also compared with GLIP in the same backbone setting and shows better performance. The model is also evaluated on the RefCOCO/+/g benchmark
Reach us at info@study.space
[slides and audio] Grounding DINO%3A Marrying DINO with Grounded Pre-Training for Open-Set Object Detection