20 Jun 2024 | Yifan Du, Kun Zhou, Yuqi Huo, Yifan Li, Wayne Xin Zhao, Haoyu Lu, Zijia Zhao, Bingning Wang, Weipeng Chen, Ji-Rong Wen
This paper introduces Event-Bench, an event-oriented long video understanding benchmark that evaluates video Multimodal Large Language Models (video MLLMs) across three levels of event understanding: atomic, composite, and overall. Event-Bench includes six event-related tasks and 2,190 test instances to comprehensively assess video event understanding ability. To address the scarcity of human-annotated, event-intensive data, we propose Video Instruction Merging (VIM), a cost-effective method that enhances video MLLMs using merged, event-intensive video instructions. Extensive experiments show that the best-performing model, GPT-4o, achieves an overall accuracy of 53.33%, significantly outperforming the best open-source model by 41.42%. VIM surpasses both state-of-the-art open-source models and GPT-4V on the Event-Bench. All code, data, and models are publicly available at https://github.com/RUCAIBox/Event-Bench.
Event-Bench is designed to evaluate video MLLMs in three levels of event understanding: atomic, composite, and overall. It includes six event-related tasks and 2,190 test instances. To construct the benchmark, we design an automatic pipeline to collect unbiased test instances from existing datasets, unifying their formats and filtering out low-quality ones. Additionally, we manually craft test instances based on event-intensive long videos from YouTube to cover complex real-world scenarios. Event-Bench contains 2,190 samples. The benchmark distinguishes itself with longer time scopes and an event-oriented focus.
To elicit human-like video understanding capabilities, it is necessary to utilize massive event-intensive video instructions for training video MLLMs. However, annotating such data is costly. To address this, we leverage existing image and video instructions to compose more complex training data. Specifically, we first employ an adaptive model architecture to handle both image and video inputs, allowing us to incorporate high-quality image instructions into the training process. Second, we propose Video Instruction Merging (VIM), which merges similar videos in the existing dataset into a new video containing all the events from the original videos. Extensive experiments on our Event-Bench demonstrate that our method outperforms all open-source models of comparable parameter scales and even surpasses GPT-4V on average.
Our main contributions are: (1) We propose an event-oriented long video benchmark, Event-Bench, to evaluate the human-like video understanding capability; (2) We devise VIM, a low-cost method to improve video MLLMs using merged event-intensive video and high-quality image instructions; (3) Experiment results show the comprehensive evaluation capability of Event-Bench for video MLLMs and the effectiveness of VIM.This paper introduces Event-Bench, an event-oriented long video understanding benchmark that evaluates video Multimodal Large Language Models (video MLLMs) across three levels of event understanding: atomic, composite, and overall. Event-Bench includes six event-related tasks and 2,190 test instances to comprehensively assess video event understanding ability. To address the scarcity of human-annotated, event-intensive data, we propose Video Instruction Merging (VIM), a cost-effective method that enhances video MLLMs using merged, event-intensive video instructions. Extensive experiments show that the best-performing model, GPT-4o, achieves an overall accuracy of 53.33%, significantly outperforming the best open-source model by 41.42%. VIM surpasses both state-of-the-art open-source models and GPT-4V on the Event-Bench. All code, data, and models are publicly available at https://github.com/RUCAIBox/Event-Bench.
Event-Bench is designed to evaluate video MLLMs in three levels of event understanding: atomic, composite, and overall. It includes six event-related tasks and 2,190 test instances. To construct the benchmark, we design an automatic pipeline to collect unbiased test instances from existing datasets, unifying their formats and filtering out low-quality ones. Additionally, we manually craft test instances based on event-intensive long videos from YouTube to cover complex real-world scenarios. Event-Bench contains 2,190 samples. The benchmark distinguishes itself with longer time scopes and an event-oriented focus.
To elicit human-like video understanding capabilities, it is necessary to utilize massive event-intensive video instructions for training video MLLMs. However, annotating such data is costly. To address this, we leverage existing image and video instructions to compose more complex training data. Specifically, we first employ an adaptive model architecture to handle both image and video inputs, allowing us to incorporate high-quality image instructions into the training process. Second, we propose Video Instruction Merging (VIM), which merges similar videos in the existing dataset into a new video containing all the events from the original videos. Extensive experiments on our Event-Bench demonstrate that our method outperforms all open-source models of comparable parameter scales and even surpasses GPT-4V on average.
Our main contributions are: (1) We propose an event-oriented long video benchmark, Event-Bench, to evaluate the human-like video understanding capability; (2) We devise VIM, a low-cost method to improve video MLLMs using merged event-intensive video and high-quality image instructions; (3) Experiment results show the comprehensive evaluation capability of Event-Bench for video MLLMs and the effectiveness of VIM.