CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark

CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark

18 Mar 2024 | Ge Zhang, Xinrun Du, Bei Chen, Yiming Liang, Tongxu Luo, Tianyu Zheng, Kang Zhu, Yuyang Cheng, Chunpu Xu, Shuyue Guo, Haoran Zhang, Xingwei Qu, Junjie Wang, Ruibin Yuan, Yizhi Li, Zekun Wang, Yudong Liu, Yu-Hsuan Tsai, Fengji Zhang, Chenghua Lin, Wenhao Huang, Wenhui Chen, Jie Fu
**CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark** **Introduction:** The paper introduces CMMMU, a new benchmark designed to evaluate large multimodal models (LMMs) on tasks requiring college-level subject knowledge and deliberate reasoning in a Chinese context. CMMMU includes 12,000 manually collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines and 30 subjects. The questions span 39 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. **Benchmark Details:** - **Data Collection:** A three-stage process involving manual collection, crowdsourcing, and supplementation to ensure comprehensive and high-quality data. - **Data Quality Control:** Strict protocols to verify and filter out unqualified questions, ensuring the questions are college-level and require expert knowledge. - **Comparison with Existing Benchmarks:** CMMMU is compared with other multimodal benchmarks in terms of input image types, input format, question types, and knowledge depth. **Evaluation:** - **Model Evaluation:** The performance of various LMMs, including open-source and closed-source models, is evaluated using zero-shot settings to assess their raw ability to generate accurate answers on multimodal tasks. - **Results:** GPT-4V achieves only 42% accuracy on CMMMU, indicating significant room for improvement. Open-source models like Yi-VL-34B and Qwen-VL-Chat perform notably better, with Yi-VL-34B narrowing the gap between open-source and closed-source models to 7%. **Error Analysis:** - **Error Types:** Perceptual errors, reasoning errors, lack of knowledge, and rejection to answer are identified as primary causes of incorrect responses. - **Analysis:** Detailed analysis of 150 examples of GPT-4V's incorrect answers reveals specific patterns and areas for improvement. **Conclusion:** CMMMU represents a significant step in advancing the development of AGI, particularly in the Chinese context. It highlights the need for more sophisticated models capable of complex reasoning and understanding in non-English contexts. The benchmark aims to guide the development of bilingual LMMs and promote their democratization through varied language contexts.**CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark** **Introduction:** The paper introduces CMMMU, a new benchmark designed to evaluate large multimodal models (LMMs) on tasks requiring college-level subject knowledge and deliberate reasoning in a Chinese context. CMMMU includes 12,000 manually collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines and 30 subjects. The questions span 39 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. **Benchmark Details:** - **Data Collection:** A three-stage process involving manual collection, crowdsourcing, and supplementation to ensure comprehensive and high-quality data. - **Data Quality Control:** Strict protocols to verify and filter out unqualified questions, ensuring the questions are college-level and require expert knowledge. - **Comparison with Existing Benchmarks:** CMMMU is compared with other multimodal benchmarks in terms of input image types, input format, question types, and knowledge depth. **Evaluation:** - **Model Evaluation:** The performance of various LMMs, including open-source and closed-source models, is evaluated using zero-shot settings to assess their raw ability to generate accurate answers on multimodal tasks. - **Results:** GPT-4V achieves only 42% accuracy on CMMMU, indicating significant room for improvement. Open-source models like Yi-VL-34B and Qwen-VL-Chat perform notably better, with Yi-VL-34B narrowing the gap between open-source and closed-source models to 7%. **Error Analysis:** - **Error Types:** Perceptual errors, reasoning errors, lack of knowledge, and rejection to answer are identified as primary causes of incorrect responses. - **Analysis:** Detailed analysis of 150 examples of GPT-4V's incorrect answers reveals specific patterns and areas for improvement. **Conclusion:** CMMMU represents a significant step in advancing the development of AGI, particularly in the Chinese context. It highlights the need for more sophisticated models capable of complex reasoning and understanding in non-English contexts. The benchmark aims to guide the development of bilingual LMMs and promote their democratization through varied language contexts.
Reach us at info@study.space