20 Jun 2024 | Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, Kai Chen
MMBench-Video is a novel long-form, multi-shot VideoQA benchmark designed to evaluate the proficiency of Large Vision-Language Models (LVLMs) in understanding video content. The benchmark addresses the limitations of traditional VideoQA benchmarks by incorporating lengthy videos from YouTube, free-form questions, and a detailed capability taxonomy. MMBench-Video aims to rigorously assess models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy. The evaluation uses GPT-4 for automated scoring, demonstrating superior accuracy and robustness compared to earlier LLM-based evaluations. Comprehensive evaluations of both proprietary and open-source LVLMs for images and videos reveal significant performance disparities, highlighting the need for advancements in video LLMs. MMBench-Video provides valuable insights into the current limitations of Video-LLMs in spatial and temporal understanding, guiding future research and development. The evaluation code will be integrated into VLMEvalKit, a resource for the research community.MMBench-Video is a novel long-form, multi-shot VideoQA benchmark designed to evaluate the proficiency of Large Vision-Language Models (LVLMs) in understanding video content. The benchmark addresses the limitations of traditional VideoQA benchmarks by incorporating lengthy videos from YouTube, free-form questions, and a detailed capability taxonomy. MMBench-Video aims to rigorously assess models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy. The evaluation uses GPT-4 for automated scoring, demonstrating superior accuracy and robustness compared to earlier LLM-based evaluations. Comprehensive evaluations of both proprietary and open-source LVLMs for images and videos reveal significant performance disparities, highlighting the need for advancements in video LLMs. MMBench-Video provides valuable insights into the current limitations of Video-LLMs in spatial and temporal understanding, guiding future research and development. The evaluation code will be integrated into VLMEvalKit, a resource for the research community.