2024-07-30 | Xuehai He, Weixi Feng, Kaizhi Zheng, Yujie Lu, Wanrong Zhu, Jiachen Li, Yue Fan, Jianfeng Wang, Linjie Li, Zhengyuan Yang, Kevin Lin, William Yang Wang, Lijuan Wang, Xin Eric Wang
MMWorld is a new benchmark for multi-discipline, multi-faceted multimodal video understanding, designed to evaluate the abilities of Multimodal Large Language Models (MLLMs) in reasoning and interpreting real-world dynamics. The benchmark covers seven broad disciplines and 69 subdisciplines, including Art & Sports, Business, Science, Health & Medicine, Embodied Tasks, Tech & Engineering, and Games. It includes a human-annotated dataset with 6,627 question-answer pairs and associated captions, and a synthetic dataset for analyzing MLLMs within single modalities. MMWorld includes 1,910 videos, with 2,627 question-answer pairs and associated captions. The benchmark evaluates 12 MLLMs, including both open-source and proprietary models, and reveals that even the best performer, GPT-4V, achieves only 52.3% accuracy. The benchmark also highlights differences in skill sets between MLLMs and humans, with MLLMs excelling in certain areas while struggling in others. MMWorld provides a comprehensive evaluation of MLLMs' abilities in multi-faceted reasoning, including explanation, counterfactual thinking, future prediction, and domain expertise. The benchmark also includes a synthetic dataset for analyzing MLLMs' perception in single modalities. The results show that MMWorld is a significant step forward in evaluating MLLMs' abilities in understanding complex video content.MMWorld is a new benchmark for multi-discipline, multi-faceted multimodal video understanding, designed to evaluate the abilities of Multimodal Large Language Models (MLLMs) in reasoning and interpreting real-world dynamics. The benchmark covers seven broad disciplines and 69 subdisciplines, including Art & Sports, Business, Science, Health & Medicine, Embodied Tasks, Tech & Engineering, and Games. It includes a human-annotated dataset with 6,627 question-answer pairs and associated captions, and a synthetic dataset for analyzing MLLMs within single modalities. MMWorld includes 1,910 videos, with 2,627 question-answer pairs and associated captions. The benchmark evaluates 12 MLLMs, including both open-source and proprietary models, and reveals that even the best performer, GPT-4V, achieves only 52.3% accuracy. The benchmark also highlights differences in skill sets between MLLMs and humans, with MLLMs excelling in certain areas while struggling in others. MMWorld provides a comprehensive evaluation of MLLMs' abilities in multi-faceted reasoning, including explanation, counterfactual thinking, future prediction, and domain expertise. The benchmark also includes a synthetic dataset for analyzing MLLMs' perception in single modalities. The results show that MMWorld is a significant step forward in evaluating MLLMs' abilities in understanding complex video content.