LVBench: An Extreme Long Video Understanding Benchmark

LVBench: An Extreme Long Video Understanding Benchmark

2024-06-12 | Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, Jie Tang
LVBench is a benchmark designed for long video understanding, addressing the gap in current multimodal models' ability to comprehend long videos. The dataset includes 103 high-quality long videos, each with a minimum duration of 30 minutes, and 1,549 question-answer pairs. The videos cover six major categories and 21 subcategories, with a focus on tasks requiring long-term memory and extended comprehension. The benchmark evaluates models on six core capabilities: temporal grounding, summarization, reasoning, entity recognition, event understanding, and key information retrieval. The dataset was curated through a multi-stage filtering process, ensuring high-quality and diverse content. The benchmark also includes a comprehensive set of evaluation tasks, with questions designed to test complex combinations of skills. The dataset is publicly available at https://lvbench.github.io/. Experiments show that current multimodal models still underperform on long video understanding tasks, highlighting the need for more advanced models. The benchmark aims to spur the development of models capable of handling the complexities of long video comprehension. The dataset includes a variety of long videos, with an average duration of 4101 seconds, providing a robust foundation for testing models on extended temporal contexts. The benchmark includes a detailed evaluation of different models, including GPT-4o, TimeChat, PLLaVA, and LLaVA-NeXT, as well as models natively supporting long videos. The results show that while some models perform well, others struggle with long videos, indicating the need for further improvements in long video understanding. The benchmark also evaluates the impact of different video and clue lengths on model performance, showing that longer videos and clues can significantly affect results. The dataset is publicly available and includes a detailed description of its composition, collection process, and usage. The dataset is under the CC-BY-NC-SA-4.0 license and is hosted on HuggingFace. The dataset is expected to be updated as needed to correct labeling errors and add new instances. The dataset is freely available and accessible, with no restrictions on its use. The benchmark provides a comprehensive evaluation framework for assessing multimodal models on complex video understanding tasks.LVBench is a benchmark designed for long video understanding, addressing the gap in current multimodal models' ability to comprehend long videos. The dataset includes 103 high-quality long videos, each with a minimum duration of 30 minutes, and 1,549 question-answer pairs. The videos cover six major categories and 21 subcategories, with a focus on tasks requiring long-term memory and extended comprehension. The benchmark evaluates models on six core capabilities: temporal grounding, summarization, reasoning, entity recognition, event understanding, and key information retrieval. The dataset was curated through a multi-stage filtering process, ensuring high-quality and diverse content. The benchmark also includes a comprehensive set of evaluation tasks, with questions designed to test complex combinations of skills. The dataset is publicly available at https://lvbench.github.io/. Experiments show that current multimodal models still underperform on long video understanding tasks, highlighting the need for more advanced models. The benchmark aims to spur the development of models capable of handling the complexities of long video comprehension. The dataset includes a variety of long videos, with an average duration of 4101 seconds, providing a robust foundation for testing models on extended temporal contexts. The benchmark includes a detailed evaluation of different models, including GPT-4o, TimeChat, PLLaVA, and LLaVA-NeXT, as well as models natively supporting long videos. The results show that while some models perform well, others struggle with long videos, indicating the need for further improvements in long video understanding. The benchmark also evaluates the impact of different video and clue lengths on model performance, showing that longer videos and clues can significantly affect results. The dataset is publicly available and includes a detailed description of its composition, collection process, and usage. The dataset is under the CC-BY-NC-SA-4.0 license and is hosted on HuggingFace. The dataset is expected to be updated as needed to correct labeling errors and add new instances. The dataset is freely available and accessible, with no restrictions on its use. The benchmark provides a comprehensive evaluation framework for assessing multimodal models on complex video understanding tasks.
Reach us at info@study.space