16 Jun 2024 | Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiaowu Zheng, Enhong Chen, Rongrong Ji, Xing Sun
Video-MME is the first comprehensive benchmark for evaluating multi-modal large language models (MLLMs) in video analysis. It includes 900 videos, 2,700 high-quality multiple-choice questions, and diverse video types across six domains with 30 sub-domains. The videos vary in duration from 11 seconds to 1 hour, and include subtitles and audio tracks to enhance multi-modal understanding. The dataset is manually curated and annotated by experts to ensure quality and diversity.
The benchmark evaluates various state-of-the-art MLLMs, including commercial models like Gemini 1.5 Pro and open-source models like InternVL-Chat-V1.5 and LLaVA-NeXT-Video. Results show that Gemini 1.5 Pro performs best, achieving an average accuracy of 75%, significantly outperforming open-source models. The benchmark also highlights the importance of subtitles and audio in video understanding, and shows that performance declines as video duration increases for all models.
The dataset and findings emphasize the need for improvements in handling longer sequences and multi-modal data, and provide insights for future MLLM development. Video-MME is a universal benchmark applicable to both image and video MLLMs, and offers a comprehensive evaluation framework for researchers. The benchmark includes detailed analysis of performance across different video durations and modalities, and provides insights into the challenges of video understanding. The results indicate that MLLMs need to improve their ability to handle long sequences and multi-modal data for better performance in video analysis.Video-MME is the first comprehensive benchmark for evaluating multi-modal large language models (MLLMs) in video analysis. It includes 900 videos, 2,700 high-quality multiple-choice questions, and diverse video types across six domains with 30 sub-domains. The videos vary in duration from 11 seconds to 1 hour, and include subtitles and audio tracks to enhance multi-modal understanding. The dataset is manually curated and annotated by experts to ensure quality and diversity.
The benchmark evaluates various state-of-the-art MLLMs, including commercial models like Gemini 1.5 Pro and open-source models like InternVL-Chat-V1.5 and LLaVA-NeXT-Video. Results show that Gemini 1.5 Pro performs best, achieving an average accuracy of 75%, significantly outperforming open-source models. The benchmark also highlights the importance of subtitles and audio in video understanding, and shows that performance declines as video duration increases for all models.
The dataset and findings emphasize the need for improvements in handling longer sequences and multi-modal data, and provide insights for future MLLM development. Video-MME is a universal benchmark applicable to both image and video MLLMs, and offers a comprehensive evaluation framework for researchers. The benchmark includes detailed analysis of performance across different video durations and modalities, and provides insights into the challenges of video understanding. The results indicate that MLLMs need to improve their ability to handle long sequences and multi-modal data for better performance in video analysis.