Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models

Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models

29 May 2024 | Qiji Zhou, Ruochen Zhou, Zike Hu, Panzhong Lu, Siyang Gao, Yue Zhang
This paper introduces the Image-of-Thought (IoT) prompting method, a novel train-free approach designed to enhance multimodal large language models (MLLMs) for visual question-answering tasks. The method enables MLLMs to engage directly with images through a step-by-step reasoning process, anchoring decisions more firmly in visual reality rather than predominantly on textual interpretations. IoT prompting automatically generates paired textual and visual rationales, integrating discrete image processing operations within the model's reasoning chains to create comprehensive multimodal rationales. This approach improves accuracy and interpretability by reducing hallucinations and diminishing reliance on textual biases. The method is MLLM-centric, allowing the model to guide the entire reasoning chain, creating a more consistent and accurate multimodal reasoning framework. The IoT method is evaluated on three benchmark datasets: MMBench, MME, and MMVet. Results show that IoT significantly improves performance across various tasks, particularly in reasoning and spatial awareness. For example, on MMBench, IoT improves performance in several categories, including Physical Property Reasoning, Object Localization, and Spatial Relationship. On MME, IoT enhances cognitive tasks, with GPT-4o showing a 5.6% improvement and Gemini-Pro-1.5 showing a 23.9% improvement. On MMVet, IoT also enhances OCR and mathematical reasoning tasks. The method's effectiveness is attributed to its ability to combine textual and visual rationales, providing a more holistic understanding of the reasoning process. The IoT method also demonstrates improved performance in tasks requiring spatial reasoning and object localization. However, it also has limitations, such as reduced performance in certain categories due to multi-image processing approaches. Despite these limitations, the IoT method offers a more integrated and accurate approach to multimodal reasoning, enhancing the model's ability to process and reason with complex cognitive tasks. The method is designed to be flexible, allowing for the selection of appropriate tools at each step, enhancing adaptability across different multimodal scenarios.This paper introduces the Image-of-Thought (IoT) prompting method, a novel train-free approach designed to enhance multimodal large language models (MLLMs) for visual question-answering tasks. The method enables MLLMs to engage directly with images through a step-by-step reasoning process, anchoring decisions more firmly in visual reality rather than predominantly on textual interpretations. IoT prompting automatically generates paired textual and visual rationales, integrating discrete image processing operations within the model's reasoning chains to create comprehensive multimodal rationales. This approach improves accuracy and interpretability by reducing hallucinations and diminishing reliance on textual biases. The method is MLLM-centric, allowing the model to guide the entire reasoning chain, creating a more consistent and accurate multimodal reasoning framework. The IoT method is evaluated on three benchmark datasets: MMBench, MME, and MMVet. Results show that IoT significantly improves performance across various tasks, particularly in reasoning and spatial awareness. For example, on MMBench, IoT improves performance in several categories, including Physical Property Reasoning, Object Localization, and Spatial Relationship. On MME, IoT enhances cognitive tasks, with GPT-4o showing a 5.6% improvement and Gemini-Pro-1.5 showing a 23.9% improvement. On MMVet, IoT also enhances OCR and mathematical reasoning tasks. The method's effectiveness is attributed to its ability to combine textual and visual rationales, providing a more holistic understanding of the reasoning process. The IoT method also demonstrates improved performance in tasks requiring spatial reasoning and object localization. However, it also has limitations, such as reduced performance in certain categories due to multi-image processing approaches. Despite these limitations, the IoT method offers a more integrated and accurate approach to multimodal reasoning, enhancing the model's ability to process and reason with complex cognitive tasks. The method is designed to be flexible, allowing for the selection of appropriate tools at each step, enhancing adaptability across different multimodal scenarios.
Reach us at info@study.space