1 Jun 2024 | Tianhe Ren*, Qing Jiang*, Shilong Liu*, Zhaoyang Zeng*, Wenlong Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xiaoke Jiang, Yihao Chen, Yuda Xiong, Hao Zhang, Feng Li, Peijun Tang, Kent Yu, Lei Zhang†
This paper introduces Grounding DINO 1.5, a suite of advanced open-set object detection models developed by IDEA Research. The suite includes two models: *Grounding DINO 1.5 Pro* and *Grounding DINO 1.5 Edge*. *Grounding DINO 1.5 Pro* is designed for stronger generalization across various scenarios, while *Grounding DINO 1.5 Edge* is optimized for faster speed, suitable for edge deployment. *Grounding DINO 1.5 Pro* scales up the model architecture, integrates an enhanced vision backbone, and expands the training dataset to over 20 million images with grounding annotations, achieving a richer semantic understanding. *Grounding DINO 1.5 Edge* maintains robust detection capabilities with reduced feature scales, trained on the same comprehensive dataset. Empirical results show that *Grounding DINO 1.5 Pro* achieves a 34.3 AP on the COCO detection benchmark and a 55.7 AP on the LVIS-minival zero-shot transfer benchmark, setting new records. *Grounding DINO 1.5 Edge*, optimized with TensorRT, achieves a speed of 75.2 FPS and a zero-shot performance of 36.2 AP on the LVIS-minival benchmark, making it suitable for edge computing scenarios. The paper also includes detailed model training, evaluation, and case analyses, demonstrating the models' effectiveness in various real-world applications.This paper introduces Grounding DINO 1.5, a suite of advanced open-set object detection models developed by IDEA Research. The suite includes two models: *Grounding DINO 1.5 Pro* and *Grounding DINO 1.5 Edge*. *Grounding DINO 1.5 Pro* is designed for stronger generalization across various scenarios, while *Grounding DINO 1.5 Edge* is optimized for faster speed, suitable for edge deployment. *Grounding DINO 1.5 Pro* scales up the model architecture, integrates an enhanced vision backbone, and expands the training dataset to over 20 million images with grounding annotations, achieving a richer semantic understanding. *Grounding DINO 1.5 Edge* maintains robust detection capabilities with reduced feature scales, trained on the same comprehensive dataset. Empirical results show that *Grounding DINO 1.5 Pro* achieves a 34.3 AP on the COCO detection benchmark and a 55.7 AP on the LVIS-minival zero-shot transfer benchmark, setting new records. *Grounding DINO 1.5 Edge*, optimized with TensorRT, achieves a speed of 75.2 FPS and a zero-shot performance of 36.2 AP on the LVIS-minival benchmark, making it suitable for edge computing scenarios. The paper also includes detailed model training, evaluation, and case analyses, demonstrating the models' effectiveness in various real-world applications.