[slides] Towards Event-oriented Long Video Understanding

The paper introduces Event-Bench, an event-oriented long video understanding benchmark designed to evaluate the human-like video understanding capabilities of Large Language Models (LLMs). Event-Bench includes six event-related tasks and 2,190 test instances, focusing on atomic, composite, and overall event understanding. To address the lack of rich events in existing datasets, the authors propose Video Instruction Merging (VIM), a cost-effective method that enhances video LLMs using merged, event-intensive video instructions. Extensive experiments show that the best-performing model, GPT-4o, achieves an overall accuracy of 53.33%, significantly outperforming the best open-source model by 41.42%. The paper also discusses the effectiveness of VIM and the limitations of the benchmark, such as the need for more diverse and complex event data.The paper introduces Event-Bench, an event-oriented long video understanding benchmark designed to evaluate the human-like video understanding capabilities of Large Language Models (LLMs). Event-Bench includes six event-related tasks and 2,190 test instances, focusing on atomic, composite, and overall event understanding. To address the lack of rich events in existing datasets, the authors propose Video Instruction Merging (VIM), a cost-effective method that enhances video LLMs using merged, event-intensive video instructions. Extensive experiments show that the best-performing model, GPT-4o, achieves an overall accuracy of 53.33%, significantly outperforming the best open-source model by 41.42%. The paper also discusses the effectiveness of VIM and the limitations of the benchmark, such as the need for more diverse and complex event data.

Towards Event-oriented Long Video Understanding

20 Jun 2024 | Yifan Du, Kun Zhou, Yuqi Huo, Yifan Li, Wayne Xin Zhao, Haoyu Lu, Zijia Zhao, Bingning Wang, Weipeng Chen, Ji-Rong Wen