26 May 2024 | Qiguang Chen, Libo Qin, Jin Zhang, Zhi Chen, Xiao Xu, Wanxiang Che
The paper introduces a novel benchmark called M³CoT (Multi-Domain, Multi-Step, Multi-Modal Chain-of-Thought) to address the limitations of existing benchmarks in multi-modal Chain-of-Thought (MCoT) reasoning. The current benchmarks lack visual modal reasoning, single-step visual modal reasoning, and domain coverage, which hinder the development of advanced MCoT models. M³CoT aims to overcome these challenges by incorporating more complex multi-modal reasoning scenarios.
The authors conduct a thorough evaluation of various Vision Large Language Models (VLLMs) on M³CoT, highlighting that while VLLMs show some improvement with larger parameter counts, they still struggle to match human performance. The study also reveals that fine-tuning can significantly enhance model performance on M³CoT, compared to vanilla in-context learning, tool usage, and prompting strategies.
Key contributions of the work include:
1. Identifying the weaknesses of current MCoT benchmarks.
2. Introducing M³CoT to address multi-domain, multi-step, and multi-modal reasoning.
3. Providing a comprehensive evaluation of MCoT approaches on M³CoT, offering insights for future research.
The paper also discusses the dataset annotation process, including sample removal, multi-step sample construction, domain augmentation, and quality assurance. The results show that VLLMs with more parameters perform better, and fine-tuning can significantly improve performance. The study concludes by emphasizing the need for more complex multi-modal interaction and higher rationale quality to enhance M³CoT performance.The paper introduces a novel benchmark called M³CoT (Multi-Domain, Multi-Step, Multi-Modal Chain-of-Thought) to address the limitations of existing benchmarks in multi-modal Chain-of-Thought (MCoT) reasoning. The current benchmarks lack visual modal reasoning, single-step visual modal reasoning, and domain coverage, which hinder the development of advanced MCoT models. M³CoT aims to overcome these challenges by incorporating more complex multi-modal reasoning scenarios.
The authors conduct a thorough evaluation of various Vision Large Language Models (VLLMs) on M³CoT, highlighting that while VLLMs show some improvement with larger parameter counts, they still struggle to match human performance. The study also reveals that fine-tuning can significantly enhance model performance on M³CoT, compared to vanilla in-context learning, tool usage, and prompting strategies.
Key contributions of the work include:
1. Identifying the weaknesses of current MCoT benchmarks.
2. Introducing M³CoT to address multi-domain, multi-step, and multi-modal reasoning.
3. Providing a comprehensive evaluation of MCoT approaches on M³CoT, offering insights for future research.
The paper also discusses the dataset annotation process, including sample removal, multi-step sample construction, domain augmentation, and quality assurance. The results show that VLLMs with more parameters perform better, and fine-tuning can significantly improve performance. The study concludes by emphasizing the need for more complex multi-modal interaction and higher rationale quality to enhance M³CoT performance.