Grounding DINO 1.5: Advance the “Edge” of Open-Set Object Detection

Grounding DINO 1.5: Advance the “Edge” of Open-Set Object Detection

1 Jun 2024 | Tianhe Ren*, Qing Jiang*, Shilong Liu*, Zhaoyang Zeng*, Wenlong Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xiaoke Jiang, Yihao Chen, Yuda Xiong, Hao Zhang, Feng Li, Peijun Tang, Kent Yu, Lei Zhang†
Grounding DINO 1.5 is a series of advanced open-set object detection models developed by IDEA Research, aiming to advance the "Edge" of open-set object detection. The series includes two models: Grounding DINO 1.5 Pro, a high-performance model designed for strong generalization across various scenarios, and Grounding DINO 1.5 Edge, an efficient model optimized for faster speed suitable for edge deployment. The Pro model improves upon its predecessor by scaling up the model architecture, integrating an enhanced vision backbone, and expanding the training dataset to over 20 million images with grounding annotations, achieving richer semantic understanding. The Edge model, designed for efficiency with reduced feature scales, maintains robust detection capabilities by being trained on the same comprehensive dataset. Empirical results show the effectiveness of Grounding DINO 1.5, with the Pro model achieving 54.3 AP on the COCO detection benchmark and 55.7 AP on the LVIS-minival zero-shot transfer benchmark. The Edge model, when optimized with TensorRT, achieves 75.2 FPS and 36.2 AP on the LVIS-minival benchmark, making it more suitable for edge computing scenarios. The Pro model also excels in real-world scenarios, achieving 58.7 AP on the ODinW13 benchmark and setting a new record on the ODinW35 benchmark. The Edge model demonstrates strong zero-shot performance on COCO and LVIS, achieving 45.0 AP on COCO and 36.2 AP on LVIS-minival. The models show robust performance in various detection tasks, including common object detection, long-tailed object detection, short and long caption grounding, dense object detection, and video object detection. Grounding DINO 1.5 Pro also demonstrates superior performance in dense scene detection, long-tailed object detection, and semantic understanding accuracy compared to the original Grounding DINO model. The Edge model is optimized for edge devices, achieving real-time performance with an inference speed of over 10 FPS on the NVIDIA Orin NX platform. The models are evaluated on multiple benchmarks, showing significant improvements over previous methods. The results highlight the effectiveness of Grounding DINO 1.5 in open-set object detection, with the Pro model setting new records on several benchmarks. The Edge model is particularly suitable for edge computing scenarios due to its efficiency and performance. The models are also demonstrated in various real-world applications, showing their versatility and effectiveness in different scenarios.Grounding DINO 1.5 is a series of advanced open-set object detection models developed by IDEA Research, aiming to advance the "Edge" of open-set object detection. The series includes two models: Grounding DINO 1.5 Pro, a high-performance model designed for strong generalization across various scenarios, and Grounding DINO 1.5 Edge, an efficient model optimized for faster speed suitable for edge deployment. The Pro model improves upon its predecessor by scaling up the model architecture, integrating an enhanced vision backbone, and expanding the training dataset to over 20 million images with grounding annotations, achieving richer semantic understanding. The Edge model, designed for efficiency with reduced feature scales, maintains robust detection capabilities by being trained on the same comprehensive dataset. Empirical results show the effectiveness of Grounding DINO 1.5, with the Pro model achieving 54.3 AP on the COCO detection benchmark and 55.7 AP on the LVIS-minival zero-shot transfer benchmark. The Edge model, when optimized with TensorRT, achieves 75.2 FPS and 36.2 AP on the LVIS-minival benchmark, making it more suitable for edge computing scenarios. The Pro model also excels in real-world scenarios, achieving 58.7 AP on the ODinW13 benchmark and setting a new record on the ODinW35 benchmark. The Edge model demonstrates strong zero-shot performance on COCO and LVIS, achieving 45.0 AP on COCO and 36.2 AP on LVIS-minival. The models show robust performance in various detection tasks, including common object detection, long-tailed object detection, short and long caption grounding, dense object detection, and video object detection. Grounding DINO 1.5 Pro also demonstrates superior performance in dense scene detection, long-tailed object detection, and semantic understanding accuracy compared to the original Grounding DINO model. The Edge model is optimized for edge devices, achieving real-time performance with an inference speed of over 10 FPS on the NVIDIA Orin NX platform. The models are evaluated on multiple benchmarks, showing significant improvements over previous methods. The results highlight the effectiveness of Grounding DINO 1.5 in open-set object detection, with the Pro model setting new records on several benchmarks. The Edge model is particularly suitable for edge computing scenarios due to its efficiency and performance. The models are also demonstrated in various real-world applications, showing their versatility and effectiveness in different scenarios.
Reach us at info@study.space
[slides] Grounding DINO 1.5%3A Advance the %22Edge%22 of Open-Set Object Detection | StudySpace