[slides and audio] MMWorld%3A Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

MMWorld is a new benchmark for multi-discipline, multi-faceted multimodal video understanding, designed to evaluate the abilities of Multimodal Large Language Models (MLLMs) in reasoning and interpreting real-world dynamics. The benchmark covers seven broad disciplines and 69 subdisciplines, including Art & Sports, Business, Science, Health & Medicine, Embodied Tasks, Tech & Engineering, and Games. It includes a human-annotated dataset with 6,627 question-answer pairs and associated captions, and a synthetic dataset for analyzing MLLMs within single modalities. MMWorld includes 1,910 videos, with 2,627 question-answer pairs and associated captions. The benchmark evaluates 12 MLLMs, including both open-source and proprietary models, and reveals that even the best performer, GPT-4V, achieves only 52.3% accuracy. The benchmark also highlights differences in skill sets between MLLMs and humans, with MLLMs excelling in certain areas while struggling in others. MMWorld provides a comprehensive evaluation of MLLMs' abilities in multi-faceted reasoning, including explanation, counterfactual thinking, future prediction, and domain expertise. The benchmark also includes a synthetic dataset for analyzing MLLMs' perception in single modalities. The results show that MMWorld is a significant step forward in evaluating MLLMs' abilities in understanding complex video content.MMWorld is a new benchmark for multi-discipline, multi-faceted multimodal video understanding, designed to evaluate the abilities of Multimodal Large Language Models (MLLMs) in reasoning and interpreting real-world dynamics. The benchmark covers seven broad disciplines and 69 subdisciplines, including Art & Sports, Business, Science, Health & Medicine, Embodied Tasks, Tech & Engineering, and Games. It includes a human-annotated dataset with 6,627 question-answer pairs and associated captions, and a synthetic dataset for analyzing MLLMs within single modalities. MMWorld includes 1,910 videos, with 2,627 question-answer pairs and associated captions. The benchmark evaluates 12 MLLMs, including both open-source and proprietary models, and reveals that even the best performer, GPT-4V, achieves only 52.3% accuracy. The benchmark also highlights differences in skill sets between MLLMs and humans, with MLLMs excelling in certain areas while struggling in others. MMWorld provides a comprehensive evaluation of MLLMs' abilities in multi-faceted reasoning, including explanation, counterfactual thinking, future prediction, and domain expertise. The benchmark also includes a synthetic dataset for analyzing MLLMs' perception in single modalities. The results show that MMWorld is a significant step forward in evaluating MLLMs' abilities in understanding complex video content.

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

2024-07-30 | Xuehai He, Weixi Feng, Kaizhi Zheng, Yujie Lu, Wanrong Zhu, Jiachen Li, Yue Fan, Jianfeng Wang, Linjie Li, Zhengyuan Yang, Kevin Lin, William Yang Wang, Lijuan Wang, Xin Eric Wang