17 Jun 2024 | Yunxin Li, Xinyu Chen, Baotian Hu, Longyue Wang, Haoyuan Shi, Min Zhang
VideoVista is a comprehensive video QA benchmark designed to evaluate the performance of video-related large language models (Video-LMMs) in video understanding and reasoning tasks. The dataset includes 3,402 videos across 14 categories, with durations ranging from a few seconds to over 10 minutes. It contains 25,000 questions and 27 tasks, covering various understanding and reasoning challenges. The dataset was constructed using an automatic data generation framework that leverages GPT-4o and advanced video analysis tools, including video splitting, object segmentation, and tracking.
The benchmark includes 19 types of understanding tasks, such as anomaly detection and interaction understanding, and 8 types of reasoning tasks, including logical and causal reasoning. Through extensive evaluations of 10 state-of-the-art Video-LMMs, the study reveals that Video-LMMs struggle with long videos and fine-grained tasks like temporal location and anomaly detection. They also show inferior logical and relational reasoning abilities compared to commercial models like GPT-4o and Gemini-1.5. Open-source Video-LMMs perform significantly worse than these models, highlighting the need for improvement in video understanding and reasoning capabilities.
The VideoVista dataset provides a diverse and challenging benchmark for evaluating Video-LMMs, covering a wide range of video content, durations, and tasks. It includes tasks such as action recognition, event description, and object tracking, as well as reasoning tasks like relation reasoning and causal reasoning. The dataset is designed to assess the comprehensive abilities of Video-LMMs, including their capacity to handle long videos and complex reasoning tasks. The study also identifies key limitations of current Video-LMMs, such as their inability to effectively process long videos and their shortcomings in reasoning tasks. The results emphasize the importance of VideoVista in advancing the development of Video-LMMs that can accurately understand and reason about videos.VideoVista is a comprehensive video QA benchmark designed to evaluate the performance of video-related large language models (Video-LMMs) in video understanding and reasoning tasks. The dataset includes 3,402 videos across 14 categories, with durations ranging from a few seconds to over 10 minutes. It contains 25,000 questions and 27 tasks, covering various understanding and reasoning challenges. The dataset was constructed using an automatic data generation framework that leverages GPT-4o and advanced video analysis tools, including video splitting, object segmentation, and tracking.
The benchmark includes 19 types of understanding tasks, such as anomaly detection and interaction understanding, and 8 types of reasoning tasks, including logical and causal reasoning. Through extensive evaluations of 10 state-of-the-art Video-LMMs, the study reveals that Video-LMMs struggle with long videos and fine-grained tasks like temporal location and anomaly detection. They also show inferior logical and relational reasoning abilities compared to commercial models like GPT-4o and Gemini-1.5. Open-source Video-LMMs perform significantly worse than these models, highlighting the need for improvement in video understanding and reasoning capabilities.
The VideoVista dataset provides a diverse and challenging benchmark for evaluating Video-LMMs, covering a wide range of video content, durations, and tasks. It includes tasks such as action recognition, event description, and object tracking, as well as reasoning tasks like relation reasoning and causal reasoning. The dataset is designed to assess the comprehensive abilities of Video-LMMs, including their capacity to handle long videos and complex reasoning tasks. The study also identifies key limitations of current Video-LMMs, such as their inability to effectively process long videos and their shortcomings in reasoning tasks. The results emphasize the importance of VideoVista in advancing the development of Video-LMMs that can accurately understand and reason about videos.