MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding

MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding

19 Jun 2024 | Junjie Zhou1,3*, Yan Shu1*, Bo Zhao1*, Boya Wu1, Shitao Xiao1, Xi Yang1 Yongping Xiong3, Bo Zhang4, Tiejun Huang1,2, Zheng Liu1*
The paper introduces MLVU (Multi-task Long Video Understanding Benchmark), a comprehensive benchmark designed to evaluate the long video understanding (LVU) capabilities of large language models (LLMs). MLVU addresses the limitations of existing benchmarks by extending video lengths, including a diverse range of video genres, and developing a variety of evaluation tasks. The benchmark features videos ranging from 3 minutes to 2 hours in length, covering various genres such as movies, documentaries, surveillance footage, and cartoons. It includes 9 task categories, such as topic reasoning, anomaly recognition, video summarization, needle question answering, ego reasoning, plot question answering, sub-scene captioning, action order, and action count. The empirical study with 20 MLLMs reveals significant room for improvement, with all methods struggling with most tasks and experiencing performance degradation with longer videos. The analysis highlights the importance of context length, image-understanding quality, and LLM backbone choice in advancing LVU capabilities. MLVU is expected to advance research in long video understanding by providing a comprehensive and in-depth analysis of MLLMs.The paper introduces MLVU (Multi-task Long Video Understanding Benchmark), a comprehensive benchmark designed to evaluate the long video understanding (LVU) capabilities of large language models (LLMs). MLVU addresses the limitations of existing benchmarks by extending video lengths, including a diverse range of video genres, and developing a variety of evaluation tasks. The benchmark features videos ranging from 3 minutes to 2 hours in length, covering various genres such as movies, documentaries, surveillance footage, and cartoons. It includes 9 task categories, such as topic reasoning, anomaly recognition, video summarization, needle question answering, ego reasoning, plot question answering, sub-scene captioning, action order, and action count. The empirical study with 20 MLLMs reveals significant room for improvement, with all methods struggling with most tasks and experiencing performance degradation with longer videos. The analysis highlights the importance of context length, image-understanding quality, and LLM backbone choice in advancing LVU capabilities. MLVU is expected to advance research in long video understanding by providing a comprehensive and in-depth analysis of MLLMs.
Reach us at info@study.space
[slides and audio] MLVU%3A Benchmarking Multi-task Long Video Understanding