16 Jun 2024 | Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Rennui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiaowu Zheng, Enhong Chen, Rongrong Ji, Xing Sun
Video-MME is the first comprehensive benchmark designed to evaluate Multi-Modal Large Language Models (MLLMs) in video analysis. It addresses the limitations of existing benchmarks by incorporating diverse video types, varying temporal durations, and multiple data modalities, including subtitles and audio. The benchmark features 900 videos with a total duration of 254 hours, manually annotated with 2,700 question-answer pairs. The evaluation includes both commercial and open-source models, such as Gemini 1.5 Pro, GPT-4V, and InternVL-Chat-V1.5. Key findings include:
1. **Diversity and Generalizability**: Video-MME covers 6 primary visual domains and 30 subfields, ensuring broad scenario generalizability.
2. **Temporal Dynamics**: Videos range from 11 seconds to 1 hour, evaluating the adaptability of MLLMs across different temporal contexts.
3. **Multi-Modal Inputs**: The benchmark integrates multi-modal inputs, enhancing the analysis of video understanding.
4. **Quality and annotations**: Rigorous manual labeling by expert annotators ensures precise and reliable model assessment.
The results show that Gemini 1.5 Pro outperforms other models, achieving an average accuracy of 75%. Open-source models, while showing potential, still have significant gaps, with VILA-1.5 achieving 59% accuracy. The integration of subtitles and audio significantly enhances video comprehension, especially for longer videos. However, performance declines with increasing video duration, highlighting the need for improvements in handling longer sequences and multi-modal data. The benchmark also suggests that image understanding is foundational for video understanding, as demonstrated by the superior performance of image-based models on Video-MME. Future research should focus on improving long-context modeling capabilities and building datasets with complex temporal understanding.Video-MME is the first comprehensive benchmark designed to evaluate Multi-Modal Large Language Models (MLLMs) in video analysis. It addresses the limitations of existing benchmarks by incorporating diverse video types, varying temporal durations, and multiple data modalities, including subtitles and audio. The benchmark features 900 videos with a total duration of 254 hours, manually annotated with 2,700 question-answer pairs. The evaluation includes both commercial and open-source models, such as Gemini 1.5 Pro, GPT-4V, and InternVL-Chat-V1.5. Key findings include:
1. **Diversity and Generalizability**: Video-MME covers 6 primary visual domains and 30 subfields, ensuring broad scenario generalizability.
2. **Temporal Dynamics**: Videos range from 11 seconds to 1 hour, evaluating the adaptability of MLLMs across different temporal contexts.
3. **Multi-Modal Inputs**: The benchmark integrates multi-modal inputs, enhancing the analysis of video understanding.
4. **Quality and annotations**: Rigorous manual labeling by expert annotators ensures precise and reliable model assessment.
The results show that Gemini 1.5 Pro outperforms other models, achieving an average accuracy of 75%. Open-source models, while showing potential, still have significant gaps, with VILA-1.5 achieving 59% accuracy. The integration of subtitles and audio significantly enhances video comprehension, especially for longer videos. However, performance declines with increasing video duration, highlighting the need for improvements in handling longer sequences and multi-modal data. The benchmark also suggests that image understanding is foundational for video understanding, as demonstrated by the superior performance of image-based models on Video-MME. Future research should focus on improving long-context modeling capabilities and building datasets with complex temporal understanding.