26 May 2024 | Qiguang Chen, Libo Qin, Jin Zhang, Zhi Chen, Xiao Xu, Wanxiang Che
M³CoT is a novel benchmark designed to address challenges in multi-domain, multi-step, and multi-modal chain-of-thought (CoT) reasoning. Current benchmarks lack visual reasoning, single-step reasoning, and domain diversity, hindering progress in multi-modal CoT. M³CoT introduces a comprehensive dataset with diverse domains, multi-step reasoning, and multi-modal interactions. It evaluates various Vision Large Language Models (VLLMs) and highlights the gap between current models and human performance. The benchmark includes data augmentation for mathematics and commonsense domains, ensuring a broader range of reasoning tasks. M³CoT provides insights into the limitations of existing models, emphasizing the need for better multi-modal reasoning and fine-tuning strategies. The dataset is constructed through rigorous annotation processes, including sample removal, multi-step reasoning construction, domain augmentation, and quality assurance. The benchmark aims to advance multi-modal reasoning research by offering a challenging and diverse dataset for evaluating models. Results show that larger VLLMs perform better, but all models still lag behind human performance. The study also explores different prompting strategies and fine-tuning approaches, revealing the importance of multi-modal interaction and complex reasoning steps in achieving effective CoT reasoning. M³CoT serves as a valuable resource for researchers aiming to improve multi-modal reasoning capabilities.M³CoT is a novel benchmark designed to address challenges in multi-domain, multi-step, and multi-modal chain-of-thought (CoT) reasoning. Current benchmarks lack visual reasoning, single-step reasoning, and domain diversity, hindering progress in multi-modal CoT. M³CoT introduces a comprehensive dataset with diverse domains, multi-step reasoning, and multi-modal interactions. It evaluates various Vision Large Language Models (VLLMs) and highlights the gap between current models and human performance. The benchmark includes data augmentation for mathematics and commonsense domains, ensuring a broader range of reasoning tasks. M³CoT provides insights into the limitations of existing models, emphasizing the need for better multi-modal reasoning and fine-tuning strategies. The dataset is constructed through rigorous annotation processes, including sample removal, multi-step reasoning construction, domain augmentation, and quality assurance. The benchmark aims to advance multi-modal reasoning research by offering a challenging and diverse dataset for evaluating models. Results show that larger VLLMs perform better, but all models still lag behind human performance. The study also explores different prompting strategies and fine-tuning approaches, revealing the importance of multi-modal interaction and complex reasoning steps in achieving effective CoT reasoning. M³CoT serves as a valuable resource for researchers aiming to improve multi-modal reasoning capabilities.