LVBench: An Extreme Long Video Understanding Benchmark

LVBench: An Extreme Long Video Understanding Benchmark

2024-06-12 | Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, Jie Tang
LVBench is a benchmark designed to evaluate the capabilities of multimodal models in understanding long videos, which are typically several hours in duration. The benchmark addresses the gap in current multimodal models' performance on long video tasks, such as embodied intelligence, in-depth movie reviews, and live sports commentary. LVBench includes a diverse dataset of publicly sourced long videos, covering six major categories and 21 subcategories, with an average video duration of 4101 seconds. The dataset is annotated through a combination of manual effort and model assistance, ensuring high quality and reliability. The benchmark defines six core capabilities for long video understanding, including temporal grounding, abstractive reasoning, entity recognition, event-based reasoning, detail-oriented reasoning, and key information retrieval. These capabilities are combined to create complex and challenging questions, enabling a comprehensive evaluation of a model's ability to process and comprehend lengthy video content. Experiments with four non-native and four native long video support models, including GPT-4o, TimeChat, PLLaVA, LLaVA-NeXT, LLaMA-VID, MovieChat, LWM, and Gemini 1.5 Pro, were conducted. The results show that while state-of-the-art models have made progress in short video understanding, their performance on long videos still falls short of human-level accuracy. Gemini 1.5 Pro achieved the best overall performance, outperforming other models in multiple tasks. However, some models that do not natively support long videos still managed to achieve competitive results. The benchmark also evaluates the impact of different video and clue durations, finding that most models perform well when the clue duration is between 0-10 seconds or greater than 60 seconds. Additionally, the distribution of answers generated by different models reveals that existing video understanding models struggle with precisely following instructions, with some models generating responses outside the specified options. LVBench aims to spur the development of more advanced models capable of tackling the complexities of long video comprehension. The dataset and code are publicly available at https://lvbench.github.io/.LVBench is a benchmark designed to evaluate the capabilities of multimodal models in understanding long videos, which are typically several hours in duration. The benchmark addresses the gap in current multimodal models' performance on long video tasks, such as embodied intelligence, in-depth movie reviews, and live sports commentary. LVBench includes a diverse dataset of publicly sourced long videos, covering six major categories and 21 subcategories, with an average video duration of 4101 seconds. The dataset is annotated through a combination of manual effort and model assistance, ensuring high quality and reliability. The benchmark defines six core capabilities for long video understanding, including temporal grounding, abstractive reasoning, entity recognition, event-based reasoning, detail-oriented reasoning, and key information retrieval. These capabilities are combined to create complex and challenging questions, enabling a comprehensive evaluation of a model's ability to process and comprehend lengthy video content. Experiments with four non-native and four native long video support models, including GPT-4o, TimeChat, PLLaVA, LLaVA-NeXT, LLaMA-VID, MovieChat, LWM, and Gemini 1.5 Pro, were conducted. The results show that while state-of-the-art models have made progress in short video understanding, their performance on long videos still falls short of human-level accuracy. Gemini 1.5 Pro achieved the best overall performance, outperforming other models in multiple tasks. However, some models that do not natively support long videos still managed to achieve competitive results. The benchmark also evaluates the impact of different video and clue durations, finding that most models perform well when the clue duration is between 0-10 seconds or greater than 60 seconds. Additionally, the distribution of answers generated by different models reveals that existing video understanding models struggle with precisely following instructions, with some models generating responses outside the specified options. LVBench aims to spur the development of more advanced models capable of tackling the complexities of long video comprehension. The dataset and code are publicly available at https://lvbench.github.io/.
Reach us at info@study.space
[slides and audio] LVBench%3A An Extreme Long Video Understanding Benchmark