DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM

DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM

23 Jul 2024 | Yixuan Wu, Yizhou Wang, Shixiang Tang, Wenhao Wu, Tong He, Wanli Ouyang, Philip Torr, Jian Wu
DetToolChain is a novel prompting paradigm designed to enhance the zero-shot object detection capabilities of multimodal large language models (MLLMs), such as GPT-4V and Gemini. The approach combines a detection prompting toolkit inspired by high-precision detection priors and a new Chain-of-Thought (CoT) to implement these prompts. The toolkit includes visual processing prompts that guide MLLMs to focus on regional information, read coordinates according to measurement standards, and infer from contextual information. The new detection CoT automatically decomposes tasks into subtasks, diagnoses predictions, and plans for progressive box refinements. The effectiveness of DetToolChain is demonstrated across various detection tasks, particularly challenging cases. Compared to existing methods, GPT-4V with DetToolChain improves state-of-the-art object detectors by +21.5% AP50 on MS COCO Novel class set, +24.23% Acc on RefCOCO val set, and +14.5% AP on D-cube describe object detection. The method achieves significant improvements in open-vocabulary detection, described object detection, and referring expression comprehension. DetToolChain allows MLLMs to support various detection tasks without instruction tuning, significantly improving baseline models. The method introduces a comprehensive set of visual processing prompts and detection reasoning prompts, along with a multimodal detection CoT to facilitate MLLMs on detection tasks. The approach enhances the detection ability of MLLMs by using visual prompts, breaking down complex tasks into subtasks, and refining detection results through CoT. The results show that DetToolChain significantly improves detection performance across various tasks and datasets, demonstrating its potential as a powerful tool for zero-shot object detection.DetToolChain is a novel prompting paradigm designed to enhance the zero-shot object detection capabilities of multimodal large language models (MLLMs), such as GPT-4V and Gemini. The approach combines a detection prompting toolkit inspired by high-precision detection priors and a new Chain-of-Thought (CoT) to implement these prompts. The toolkit includes visual processing prompts that guide MLLMs to focus on regional information, read coordinates according to measurement standards, and infer from contextual information. The new detection CoT automatically decomposes tasks into subtasks, diagnoses predictions, and plans for progressive box refinements. The effectiveness of DetToolChain is demonstrated across various detection tasks, particularly challenging cases. Compared to existing methods, GPT-4V with DetToolChain improves state-of-the-art object detectors by +21.5% AP50 on MS COCO Novel class set, +24.23% Acc on RefCOCO val set, and +14.5% AP on D-cube describe object detection. The method achieves significant improvements in open-vocabulary detection, described object detection, and referring expression comprehension. DetToolChain allows MLLMs to support various detection tasks without instruction tuning, significantly improving baseline models. The method introduces a comprehensive set of visual processing prompts and detection reasoning prompts, along with a multimodal detection CoT to facilitate MLLMs on detection tasks. The approach enhances the detection ability of MLLMs by using visual prompts, breaking down complex tasks into subtasks, and refining detection results through CoT. The results show that DetToolChain significantly improves detection performance across various tasks and datasets, demonstrating its potential as a powerful tool for zero-shot object detection.
Reach us at info@study.space
[slides and audio] DetToolChain%3A A New Prompting Paradigm to Unleash Detection Ability of MLLM