18 Mar 2024 | Ge Zhang, Xinrun Du, Bei Chen, Yiming Liang, Tongxu Luo, Tianyu Zheng, Kang Zhu, Yuyang Cheng, Chunpu Xu, Shuyue Guo, Haoran Zhang, Xingwei Qu, Junjie Wang, Rui bin Yuan, Yizhi Li, Zekun Wang, Yudong Liu, Yu-Hsuan Tsai, Fengji Zhang, Chenghua Lin, Wenhao Huang, Wenhu Chen, Jie Fu
CMMMU is a new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to evaluate large multimodal models (LMMs) on tasks requiring college-level subject knowledge and deliberate reasoning in a Chinese context. Inspired by the MMMU benchmark, CMMMU includes 12,000 manually collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines and 39 highly heterogeneous image types. The benchmark focuses on complex perception and reasoning with domain-specific knowledge in the Chinese context. It evaluates 11 open-source LLMs and one proprietary GPT-4V, achieving an accuracy of 42%, indicating significant room for improvement. CMMMU aims to enhance the development of next-generation LMMs for expert AI and support LMM democratization through offering varied language contexts.
CMMMU is one of the most comprehensive benchmarks for evaluating LMMs' complex reasoning and perception abilities. Each question is annotated with detailed subfields and image types to investigate which types of questions are difficult for LMMs. A comprehensive error analysis of 150 samples, which GPT-4V answers incorrectly, reveals that the most advanced LMMs still struggle with complex reasoning and understanding in a Chinese context. The benchmark reveals that the gap between open-source bilingual LMMs and closed-source LMMs in a Chinese context is much smaller than in English, as demonstrated in MMMU. For example, the most powerful open-source LMM, Yi-VL-34B, achieves an accuracy of 36%, with a 7% gap compared to GPT-4V, while the gap in English is 11%.
CMMMU provides a comprehensive evaluation of various models, including LLMs and LMMs, with both closed-source and open-source implementations. The evaluation process employs zero-shot settings to examine the raw ability of the model to generate accurate answers on multimodal tasks. The benchmark includes 12,000 questions, divided into few-shot development set, validation set, and test set. The benchmark covers 6 disciplines, including Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering, spanning over 30 subjects. The benchmark includes a variety of question types, such as multiple-choice, fill-in-the-blank, and true/false questions, and requires thoughtful reasoning with university-level subject knowledge.
The results of CMMMU show that the benchmark is much more challenging than MMMU, while MMMU is already very challenging. GPT-4V only achieves an accuracy of 41.7% while it achieves an accuracy of 55.7% in an English context. The disparity between representative open-source models and GPT-4V is relatively smaller in a Chinese context compared to MMMU. The disparity between Qwen-VL-Chat and GPT-4V on CMMMU isCMMMU is a new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to evaluate large multimodal models (LMMs) on tasks requiring college-level subject knowledge and deliberate reasoning in a Chinese context. Inspired by the MMMU benchmark, CMMMU includes 12,000 manually collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines and 39 highly heterogeneous image types. The benchmark focuses on complex perception and reasoning with domain-specific knowledge in the Chinese context. It evaluates 11 open-source LLMs and one proprietary GPT-4V, achieving an accuracy of 42%, indicating significant room for improvement. CMMMU aims to enhance the development of next-generation LMMs for expert AI and support LMM democratization through offering varied language contexts.
CMMMU is one of the most comprehensive benchmarks for evaluating LMMs' complex reasoning and perception abilities. Each question is annotated with detailed subfields and image types to investigate which types of questions are difficult for LMMs. A comprehensive error analysis of 150 samples, which GPT-4V answers incorrectly, reveals that the most advanced LMMs still struggle with complex reasoning and understanding in a Chinese context. The benchmark reveals that the gap between open-source bilingual LMMs and closed-source LMMs in a Chinese context is much smaller than in English, as demonstrated in MMMU. For example, the most powerful open-source LMM, Yi-VL-34B, achieves an accuracy of 36%, with a 7% gap compared to GPT-4V, while the gap in English is 11%.
CMMMU provides a comprehensive evaluation of various models, including LLMs and LMMs, with both closed-source and open-source implementations. The evaluation process employs zero-shot settings to examine the raw ability of the model to generate accurate answers on multimodal tasks. The benchmark includes 12,000 questions, divided into few-shot development set, validation set, and test set. The benchmark covers 6 disciplines, including Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering, spanning over 30 subjects. The benchmark includes a variety of question types, such as multiple-choice, fill-in-the-blank, and true/false questions, and requires thoughtful reasoning with university-level subject knowledge.
The results of CMMMU show that the benchmark is much more challenging than MMMU, while MMMU is already very challenging. GPT-4V only achieves an accuracy of 41.7% while it achieves an accuracy of 55.7% in an English context. The disparity between representative open-source models and GPT-4V is relatively smaller in a Chinese context compared to MMMU. The disparity between Qwen-VL-Chat and GPT-4V on CMMMU is