[slides and audio] M3CoT%3A A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought

The paper introduces a novel benchmark called M³CoT (Multi-Domain, Multi-Step, Multi-Modal Chain-of-Thought) to address the limitations of existing benchmarks in multi-modal Chain-of-Thought (MCoT) reasoning. The current benchmarks lack visual modal reasoning, single-step visual modal reasoning, and domain coverage, which hinder the development of advanced MCoT models. M³CoT aims to overcome these challenges by incorporating more complex multi-modal reasoning scenarios. The authors conduct a thorough evaluation of various Vision Large Language Models (VLLMs) on M³CoT, highlighting that while VLLMs show some improvement with larger parameter counts, they still struggle to match human performance. The study also reveals that fine-tuning can significantly enhance model performance on M³CoT, compared to vanilla in-context learning, tool usage, and prompting strategies. Key contributions of the work include: 1. Identifying the weaknesses of current MCoT benchmarks. 2. Introducing M³CoT to address multi-domain, multi-step, and multi-modal reasoning. 3. Providing a comprehensive evaluation of MCoT approaches on M³CoT, offering insights for future research. The paper also discusses the dataset annotation process, including sample removal, multi-step sample construction, domain augmentation, and quality assurance. The results show that VLLMs with more parameters perform better, and fine-tuning can significantly improve performance. The study concludes by emphasizing the need for more complex multi-modal interaction and higher rationale quality to enhance M³CoT performance.The paper introduces a novel benchmark called M³CoT (Multi-Domain, Multi-Step, Multi-Modal Chain-of-Thought) to address the limitations of existing benchmarks in multi-modal Chain-of-Thought (MCoT) reasoning. The current benchmarks lack visual modal reasoning, single-step visual modal reasoning, and domain coverage, which hinder the development of advanced MCoT models. M³CoT aims to overcome these challenges by incorporating more complex multi-modal reasoning scenarios. The authors conduct a thorough evaluation of various Vision Large Language Models (VLLMs) on M³CoT, highlighting that while VLLMs show some improvement with larger parameter counts, they still struggle to match human performance. The study also reveals that fine-tuning can significantly enhance model performance on M³CoT, compared to vanilla in-context learning, tool usage, and prompting strategies. Key contributions of the work include: 1. Identifying the weaknesses of current MCoT benchmarks. 2. Introducing M³CoT to address multi-domain, multi-step, and multi-modal reasoning. 3. Providing a comprehensive evaluation of MCoT approaches on M³CoT, offering insights for future research. The paper also discusses the dataset annotation process, including sample removal, multi-step sample construction, domain augmentation, and quality assurance. The results show that VLLMs with more parameters perform better, and fine-tuning can significantly improve performance. The study concludes by emphasizing the need for more complex multi-modal interaction and higher rationale quality to enhance M³CoT performance.

M³CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought

26 May 2024 | Qiguang Chen, Libo Qin, Jin Zhang, Zhi Chen, Xiao Xu, Wanxiang Che