Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

19 Jul 2024 | Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang
The paper introduces Grounding DINO, a novel open-set object detector that integrates Transformer-based detection with grounded pre-training. Grounding DINO can detect arbitrary objects specified by human language inputs, such as category names or referring expressions. The key to achieving this is the introduction of language into a closed-set detector for open-set concept generalization. The authors propose a tight fusion solution that includes a feature enhancer, a language-guided query selection module, and a cross-modality decoder to effectively fuse visual and textual modalities. Grounding DINO is pre-trained on large-scale datasets, including object detection, grounding, and caption data, and is evaluated on various benchmarks, including COCO, LVIS, ODinW, and RefCOCO+/g. The model achieves state-of-the-art performance on zero-shot detection benchmarks, outperforming competitors by a significant margin. Additionally, the paper explores the application of Grounding DINO in image editing by combining it with Stable Diffusion.The paper introduces Grounding DINO, a novel open-set object detector that integrates Transformer-based detection with grounded pre-training. Grounding DINO can detect arbitrary objects specified by human language inputs, such as category names or referring expressions. The key to achieving this is the introduction of language into a closed-set detector for open-set concept generalization. The authors propose a tight fusion solution that includes a feature enhancer, a language-guided query selection module, and a cross-modality decoder to effectively fuse visual and textual modalities. Grounding DINO is pre-trained on large-scale datasets, including object detection, grounding, and caption data, and is evaluated on various benchmarks, including COCO, LVIS, ODinW, and RefCOCO+/g. The model achieves state-of-the-art performance on zero-shot detection benchmarks, outperforming competitors by a significant margin. Additionally, the paper explores the application of Grounding DINO in image editing by combining it with Stable Diffusion.
Reach us at info@study.space
[slides and audio] Grounding DINO%3A Marrying DINO with Grounded Pre-Training for Open-Set Object Detection