17 Jun 2024 | Yunxin Li, Xinyu Chen, Baotian Hu, Longyue Wang, Haoyuan Shi, Min Zhang
**VideoVista: A Versatile Benchmark for Video Understanding and Reasoning**
The paper introduces VideoVista, a comprehensive video QA benchmark designed to evaluate the performance of large multimodal models (LMMs) in video understanding and reasoning. VideoVista includes 25,000 questions derived from 3,400 videos across 14 categories, ranging from a few seconds to over 10 minutes in duration. The benchmark covers 19 types of understanding tasks and 8 types of reasoning tasks, such as anomaly detection and logical reasoning.
To construct VideoVista, an automatic data construction framework is proposed, leveraging GPT-4o and advanced video analysis tools like video splitting, object segmenting, and tracking. This framework also generates training data to enhance the capabilities of video-related LMMs (Video-LMMs).
Experiments on cutting-edge Video-LMMs reveal that these models face challenges in fine-grained video tasks, logical and relation reasoning, and perform significantly worse than commercial models like GPT-4o and Gemini-1.5. The paper highlights the importance of VideoVista in advancing LMMs to accurately understand videos and perform precise reasoning.
**Contributions:**
1. **Dataset Construction:** VideoVista is a versatile video QA benchmark with diverse content categories, varying durations, and comprehensive tasks.
2. **Automatic Data Construction:** An efficient framework for creating large-scale training and evaluation datasets using GPT-4o and advanced video analysis methods.
3. **Performance Analysis:** Identifies key shortcomings of current Video-LMMs in understanding, reasoning, and comprehensive abilities, providing insights for future improvements.
**Future Work:**
1. **Improving Long Video Encoding:** Enhancing long-context processing and optimizing video frame downsampling.
2. **Integrating Multi-Modal Information:** Leveraging audio information to improve performance in certain categories.
**Limitations:**
1. **Video Duration:** The dataset's maximum video length of 919 seconds does not cover longer content like movies.
2. **Annotation Errors:** GPT-4o's hallucinations due to insufficient data can lead to errors in annotations.
**Conclusion:**
VideoVista provides a robust framework for assessing and enhancing the capabilities of Video-LMMs, addressing the need for a versatile benchmark in video understanding and reasoning.**VideoVista: A Versatile Benchmark for Video Understanding and Reasoning**
The paper introduces VideoVista, a comprehensive video QA benchmark designed to evaluate the performance of large multimodal models (LMMs) in video understanding and reasoning. VideoVista includes 25,000 questions derived from 3,400 videos across 14 categories, ranging from a few seconds to over 10 minutes in duration. The benchmark covers 19 types of understanding tasks and 8 types of reasoning tasks, such as anomaly detection and logical reasoning.
To construct VideoVista, an automatic data construction framework is proposed, leveraging GPT-4o and advanced video analysis tools like video splitting, object segmenting, and tracking. This framework also generates training data to enhance the capabilities of video-related LMMs (Video-LMMs).
Experiments on cutting-edge Video-LMMs reveal that these models face challenges in fine-grained video tasks, logical and relation reasoning, and perform significantly worse than commercial models like GPT-4o and Gemini-1.5. The paper highlights the importance of VideoVista in advancing LMMs to accurately understand videos and perform precise reasoning.
**Contributions:**
1. **Dataset Construction:** VideoVista is a versatile video QA benchmark with diverse content categories, varying durations, and comprehensive tasks.
2. **Automatic Data Construction:** An efficient framework for creating large-scale training and evaluation datasets using GPT-4o and advanced video analysis methods.
3. **Performance Analysis:** Identifies key shortcomings of current Video-LMMs in understanding, reasoning, and comprehensive abilities, providing insights for future improvements.
**Future Work:**
1. **Improving Long Video Encoding:** Enhancing long-context processing and optimizing video frame downsampling.
2. **Integrating Multi-Modal Information:** Leveraging audio information to improve performance in certain categories.
**Limitations:**
1. **Video Duration:** The dataset's maximum video length of 919 seconds does not cover longer content like movies.
2. **Annotation Errors:** GPT-4o's hallucinations due to insufficient data can lead to errors in annotations.
**Conclusion:**
VideoVista provides a robust framework for assessing and enhancing the capabilities of Video-LMMs, addressing the need for a versatile benchmark in video understanding and reasoning.