30 Jul 2024 | Xuehai He, Weixi Feng, Kaizhi Zheng, Yujie Lu, Wanrong Zhu, Jiachen Li, Yue Fan, Jianfeng Wang, Linjie Li, Zhengyuan Yang, Kevin Lin, William Yang Wang, Lijuan Wang, Xin Eric Wang
MMWorld is a new benchmark designed to evaluate the capabilities of Multimodal Large Language Models (MLLMs) in understanding and reasoning about complex real-world dynamics through video content. The benchmark covers seven broad disciplines and 69 subdisciplines, focusing on multi-faceted reasoning beyond perception, such as explanation, counterfactual thinking, future prediction, and domain expertise. MMWorld consists of a human-annotated dataset and a synthetic dataset. The human-annotated dataset includes 1,559 question-answer pairs and video captions, while the synthetic dataset is used to analyze MLLMs' perception within single modalities. The benchmark evaluates 12 MLLMs, including both open-source and proprietary models, and reveals significant challenges for existing models, with even the best performer achieving only 52.3% accuracy. The study also highlights differences in skill sets between MLLMs and humans, providing insights into their capabilities and limitations. MMWorld aims to serve as a crucial step towards comprehensive world model evaluation in videos.MMWorld is a new benchmark designed to evaluate the capabilities of Multimodal Large Language Models (MLLMs) in understanding and reasoning about complex real-world dynamics through video content. The benchmark covers seven broad disciplines and 69 subdisciplines, focusing on multi-faceted reasoning beyond perception, such as explanation, counterfactual thinking, future prediction, and domain expertise. MMWorld consists of a human-annotated dataset and a synthetic dataset. The human-annotated dataset includes 1,559 question-answer pairs and video captions, while the synthetic dataset is used to analyze MLLMs' perception within single modalities. The benchmark evaluates 12 MLLMs, including both open-source and proprietary models, and reveals significant challenges for existing models, with even the best performer achieving only 52.3% accuracy. The study also highlights differences in skill sets between MLLMs and humans, providing insights into their capabilities and limitations. MMWorld aims to serve as a crucial step towards comprehensive world model evaluation in videos.