Understanding Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models

The paper introduces the Image-of-Thought (IoT) prompting method, a novel approach to enhance multimodal large language models (MLLMs) in visual question-answering tasks. IoT prompts MLLMs to extract visual rationales step-by-step, integrating both textual and visual information to improve accuracy and interpretability. The method involves a structured approach where the MLLM autonomously plans and executes a sequence of image processing actions, generating visual and textual rationales that align with each step of the reasoning process. This integration ensures that each decision is substantiated by direct image evidence, reducing reliance on textual interpretations and hallucinations. Experimental results on various benchmarks, including MMBench, MME, and MMVet, demonstrate significant improvements in zero-shot visual reasoning performance, particularly in cognitive tasks. The IoT method also enhances the model's ability to handle complex visual reasoning problems, making it a promising approach for improving the multimodal reasoning capabilities of MLLMs.The paper introduces the Image-of-Thought (IoT) prompting method, a novel approach to enhance multimodal large language models (MLLMs) in visual question-answering tasks. IoT prompts MLLMs to extract visual rationales step-by-step, integrating both textual and visual information to improve accuracy and interpretability. The method involves a structured approach where the MLLM autonomously plans and executes a sequence of image processing actions, generating visual and textual rationales that align with each step of the reasoning process. This integration ensures that each decision is substantiated by direct image evidence, reducing reliance on textual interpretations and hallucinations. Experimental results on various benchmarks, including MMBench, MME, and MMVet, demonstrate significant improvements in zero-shot visual reasoning performance, particularly in cognitive tasks. The IoT method also enhances the model's ability to handle complex visual reasoning problems, making it a promising approach for improving the multimodal reasoning capabilities of MLLMs.

Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models

29 May 2024 | Qiji Zhou, Ruochen Zhou, Zike Hu, Panzhong Lu, Siyang Gao, Yue Zhang