MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding

MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding

19 Jun 2024 | Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, Zheng Liu
MLVU is a comprehensive benchmark for multi-task long video understanding (LVU), designed to evaluate the performance of large language models (LLMs) in understanding long videos. The benchmark includes 2593 evaluation tasks across nine categories, covering a wide range of video genres and diverse LVU tasks. MLVU provides substantial video length extensions, from 3 minutes to over 2 hours, and includes various video types such as movies, documentaries, surveillance footage, and game videos. It also features a variety of evaluation tasks, including holistic LVU, single-detail LVU, and multi-detail LVU, which assess different aspects of LVU. The benchmark evaluates 20 MLLMs, revealing that long-video understanding remains a technically challenging problem for current models. Results show that GPT-4o performs best, but all models struggle with most tasks, indicating the need for improvements in context length, image understanding, and LLM backbones. MLVU provides a comprehensive analysis of MLLMs' LVU capabilities, highlighting the importance of diverse and long video data for effective evaluation. The benchmark aims to advance research in long video understanding by offering a detailed and comprehensive assessment of MLLMs.MLVU is a comprehensive benchmark for multi-task long video understanding (LVU), designed to evaluate the performance of large language models (LLMs) in understanding long videos. The benchmark includes 2593 evaluation tasks across nine categories, covering a wide range of video genres and diverse LVU tasks. MLVU provides substantial video length extensions, from 3 minutes to over 2 hours, and includes various video types such as movies, documentaries, surveillance footage, and game videos. It also features a variety of evaluation tasks, including holistic LVU, single-detail LVU, and multi-detail LVU, which assess different aspects of LVU. The benchmark evaluates 20 MLLMs, revealing that long-video understanding remains a technically challenging problem for current models. Results show that GPT-4o performs best, but all models struggle with most tasks, indicating the need for improvements in context length, image understanding, and LLM backbones. MLVU provides a comprehensive analysis of MLLMs' LVU capabilities, highlighting the importance of diverse and long video data for effective evaluation. The benchmark aims to advance research in long video understanding by offering a detailed and comprehensive assessment of MLLMs.
Reach us at info@study.space
[slides] MLVU%3A Benchmarking Multi-task Long Video Understanding | StudySpace