[slides and audio] Cantor%3A Inspiring Multimodal Chain-of-Thought of MLLM

The paper introduces Cantor, a novel multimodal chain-of-thought (CoT) framework designed to enhance the visual reasoning capabilities of large language models (LLMs). Cantor addresses the limitations of existing multimodal CoT methods by integrating visual context and logical reasoning, leveraging the advanced cognitive functions of MLLMs. The framework consists of two stages: Decision-Generation and Execution. During Decision-Generation, Cantor processes visual and textual inputs to form a comprehensive understanding of the problem, integrating visual information to improve decision-making accuracy. The Execution stage involves executing sub-tasks assigned by the Decision-Generation stage using various expert modules, which are specialized MLLMs. These expert modules provide high-level information, enhancing the overall reasoning process. Extensive experiments on two complex visual reasoning datasets, ScienceQA and MathVista, demonstrate the effectiveness of Cantor, showing significant improvements in accuracy without requiring fine-tuning or ground-truth rationales. The paper also provides detailed analyses of the impact of visual information and the usage of expert modules, highlighting the importance of integrating visual context and leveraging the multi-modal capabilities of MLLMs.The paper introduces Cantor, a novel multimodal chain-of-thought (CoT) framework designed to enhance the visual reasoning capabilities of large language models (LLMs). Cantor addresses the limitations of existing multimodal CoT methods by integrating visual context and logical reasoning, leveraging the advanced cognitive functions of MLLMs. The framework consists of two stages: Decision-Generation and Execution. During Decision-Generation, Cantor processes visual and textual inputs to form a comprehensive understanding of the problem, integrating visual information to improve decision-making accuracy. The Execution stage involves executing sub-tasks assigned by the Decision-Generation stage using various expert modules, which are specialized MLLMs. These expert modules provide high-level information, enhancing the overall reasoning process. Extensive experiments on two complex visual reasoning datasets, ScienceQA and MathVista, demonstrate the effectiveness of Cantor, showing significant improvements in accuracy without requiring fine-tuning or ground-truth rationales. The paper also provides detailed analyses of the impact of visual information and the usage of expert modules, highlighting the importance of integrating visual context and leveraging the multi-modal capabilities of MLLMs.

Cantor: Inspiring Multimodal Chain-of-Thought of MLLM

24 Apr 2024 | Timin Gao, Peixian Chen, Mengdan Zhang, Chaoyou Fu, Yunhang Shen, Yan Zhang, Shengchuan Zhang, Xiawu Zheng, Xing Sun, Liujuan Cao, Rongrong Ji