20 Jun 2024 | Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, Kai Chen
MMBenchmark-Video is a long-form, multi-shot video understanding benchmark designed to evaluate the capabilities of large vision-language models (LVLMs) in video comprehension. It includes approximately 600 web videos from YouTube, spanning 16 major categories, with each video ranging from 30 seconds to 6 minutes in duration. The benchmark contains around 2,000 original question-answer (QA) pairs, covering 26 fine-grained capabilities, and is meticulously crafted to assess models' temporal reasoning skills. All questions are human-annotated according to a carefully constructed ability taxonomy. The benchmark is evaluated using GPT-4 for automated scoring, which demonstrates superior accuracy and robustness compared to earlier LLM-based evaluations. MMBench-Video provides a comprehensive evaluation of both open-source and proprietary LVLMs, revealing significant performance limitations in spatial and temporal understanding. The benchmark highlights the need for improvements in video understanding capabilities, particularly in temporal reasoning. The evaluation code of MMBench-Video is integrated into VLMEvalKit: https://github.com/open-compass/VLMEvalKit. MMBench-Video addresses the limitations of existing VideoQA benchmarks by incorporating longer videos, more diverse content, and a more comprehensive capability taxonomy. It also introduces a 3-grade marking scheme for evaluation, which aligns better with human judgments. The benchmark includes a detailed analysis of the temporal indispensability of questions, with MMBench-Video showing significantly better temporal indispensability compared to other benchmarks. The benchmark also includes a detailed analysis of the performance of various LVLMs, including open-source and proprietary models, on different capabilities. The results show that existing Video-LLMs perform subpar on MMBench-Video, significantly underperforming proprietary LVLMs and even lagging behind open-source LVLMs. The benchmark also includes experiments on the impact of incorporating speech information, which enhances the performance of the state-of-the-art proprietary model, GPT-4o. The benchmark also highlights the superior performance of GPT-4 as a judge model, which provides more accurate and consistent evaluations compared to GPT-3.5. Overall, MMBench-Video provides a valuable resource for the research community, facilitating improved evaluation of LVLMs and catalyzing progress in the field of video understanding.MMBenchmark-Video is a long-form, multi-shot video understanding benchmark designed to evaluate the capabilities of large vision-language models (LVLMs) in video comprehension. It includes approximately 600 web videos from YouTube, spanning 16 major categories, with each video ranging from 30 seconds to 6 minutes in duration. The benchmark contains around 2,000 original question-answer (QA) pairs, covering 26 fine-grained capabilities, and is meticulously crafted to assess models' temporal reasoning skills. All questions are human-annotated according to a carefully constructed ability taxonomy. The benchmark is evaluated using GPT-4 for automated scoring, which demonstrates superior accuracy and robustness compared to earlier LLM-based evaluations. MMBench-Video provides a comprehensive evaluation of both open-source and proprietary LVLMs, revealing significant performance limitations in spatial and temporal understanding. The benchmark highlights the need for improvements in video understanding capabilities, particularly in temporal reasoning. The evaluation code of MMBench-Video is integrated into VLMEvalKit: https://github.com/open-compass/VLMEvalKit. MMBench-Video addresses the limitations of existing VideoQA benchmarks by incorporating longer videos, more diverse content, and a more comprehensive capability taxonomy. It also introduces a 3-grade marking scheme for evaluation, which aligns better with human judgments. The benchmark includes a detailed analysis of the temporal indispensability of questions, with MMBench-Video showing significantly better temporal indispensability compared to other benchmarks. The benchmark also includes a detailed analysis of the performance of various LVLMs, including open-source and proprietary models, on different capabilities. The results show that existing Video-LLMs perform subpar on MMBench-Video, significantly underperforming proprietary LVLMs and even lagging behind open-source LVLMs. The benchmark also includes experiments on the impact of incorporating speech information, which enhances the performance of the state-of-the-art proprietary model, GPT-4o. The benchmark also highlights the superior performance of GPT-4 as a judge model, which provides more accurate and consistent evaluations compared to GPT-3.5. Overall, MMBench-Video provides a valuable resource for the research community, facilitating improved evaluation of LVLMs and catalyzing progress in the field of video understanding.