DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM

DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM

23 Jul 2024 | Yixuan Wu, Yizhou Wang, Shixiang Tang, Wenhao Wu, Tong He, Wanli Ouyang, Philip Torr, and Jian Wu
**DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM** **Authors:** Yixuan Wu, Yizhou Wang, Shixiang Tang, Wenhao Wu, Tong He, Wanli Ouyang, Philip Torr, Jian Wu **Institution:** Zhejiang University, Shanghai AI Lab, The Chinese University of Hong Kong, The University of Sydney, University of Oxford **Abstract:** This paper introduces DetToolChain, a novel prompting paradigm designed to enhance the zero-shot object detection capabilities of multimodal large language models (MLLMs) such as GPT-4V and Gemini. The approach consists of a detection prompting toolkit inspired by high-precision detection priors and a new Chain-of-Thought (CoT) mechanism. The toolkit includes visual processing prompts and detection reasoning prompts, which guide the MLLM to focus on regional information, read coordinates, and infer from contextual information. The CoT automatically decomposes the task into subtasks, diagnoses predictions, and refines bounding boxes. The effectiveness of DetToolChain is demonstrated across various detection tasks, showing significant improvements over existing methods. For instance, GPT-4V with DetToolChain achieves +21.5% AP50 on MS COCO Novel class set for open-vocabulary detection, +24.23% Acc on RefCOCO val set for zero-shot referring expression comprehension, and +14.5% AP on D-cube describe object detection FULL setting. **Keywords:** Multimodal Large Language Model, Prompting, Detection **Introduction:** Large language models (LLMs) have shown remarkable capabilities in understanding human languages and solving practical problems. However, their performance in object detection tasks remains underutilized. Prior efforts to improve detection capabilities include fine-tuning with high-quality question-answer pairs and designing textual or visual prompts. This work introduces DetToolChain, which combines visual processing prompts and detection reasoning prompts with a multimodal CoT to enhance MLLMs' detection abilities. The visual processing prompts include regional amplifiers, spatial measurement standards, and scene image parsers, while the detection reasoning prompts help diagnose and refine predictions. The Det-CoT manages the entire detection process, ensuring accurate and reliable results. **Method:** The DetToolChain framework is designed to enhance MLLMs' detection capabilities through a combination of visual processing prompts and detection reasoning prompts. The visual processing prompts pre-process the input image to improve detection performance, while the detection reasoning prompts evaluate and refine predictions. The Det-CoT manages the entire detection process, ensuring accurate and reliable results. **Experiments:** The effectiveness of DetToolChain is evaluated across various detection tasks, including open-vocabulary detection, described object detection, referring expression comprehension, and oriented object detection. The results show significant improvements over existing methods, demonstrating the potential of DetToolChain to enhance MLLMs' detection capabilities. **Conclusion:**DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM** **Authors:** Yixuan Wu, Yizhou Wang, Shixiang Tang, Wenhao Wu, Tong He, Wanli Ouyang, Philip Torr, Jian Wu **Institution:** Zhejiang University, Shanghai AI Lab, The Chinese University of Hong Kong, The University of Sydney, University of Oxford **Abstract:** This paper introduces DetToolChain, a novel prompting paradigm designed to enhance the zero-shot object detection capabilities of multimodal large language models (MLLMs) such as GPT-4V and Gemini. The approach consists of a detection prompting toolkit inspired by high-precision detection priors and a new Chain-of-Thought (CoT) mechanism. The toolkit includes visual processing prompts and detection reasoning prompts, which guide the MLLM to focus on regional information, read coordinates, and infer from contextual information. The CoT automatically decomposes the task into subtasks, diagnoses predictions, and refines bounding boxes. The effectiveness of DetToolChain is demonstrated across various detection tasks, showing significant improvements over existing methods. For instance, GPT-4V with DetToolChain achieves +21.5% AP50 on MS COCO Novel class set for open-vocabulary detection, +24.23% Acc on RefCOCO val set for zero-shot referring expression comprehension, and +14.5% AP on D-cube describe object detection FULL setting. **Keywords:** Multimodal Large Language Model, Prompting, Detection **Introduction:** Large language models (LLMs) have shown remarkable capabilities in understanding human languages and solving practical problems. However, their performance in object detection tasks remains underutilized. Prior efforts to improve detection capabilities include fine-tuning with high-quality question-answer pairs and designing textual or visual prompts. This work introduces DetToolChain, which combines visual processing prompts and detection reasoning prompts with a multimodal CoT to enhance MLLMs' detection abilities. The visual processing prompts include regional amplifiers, spatial measurement standards, and scene image parsers, while the detection reasoning prompts help diagnose and refine predictions. The Det-CoT manages the entire detection process, ensuring accurate and reliable results. **Method:** The DetToolChain framework is designed to enhance MLLMs' detection capabilities through a combination of visual processing prompts and detection reasoning prompts. The visual processing prompts pre-process the input image to improve detection performance, while the detection reasoning prompts evaluate and refine predictions. The Det-CoT manages the entire detection process, ensuring accurate and reliable results. **Experiments:** The effectiveness of DetToolChain is evaluated across various detection tasks, including open-vocabulary detection, described object detection, referring expression comprehension, and oriented object detection. The results show significant improvements over existing methods, demonstrating the potential of DetToolChain to enhance MLLMs' detection capabilities. **Conclusion:
Reach us at info@study.space