LongVLM: Efficient Long Video Understanding via Large Language Models

LongVLM: Efficient Long Video Understanding via Large Language Models

20 Jul 2024 | Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, and Bohan Zhuang
LongVLM is a VideoLLM designed for efficient long video understanding. It addresses the challenge of detailed understanding in long-term videos by decomposing them into short-term segments and encoding local features for each segment through a hierarchical token merging module. These features are concatenated in temporal order to maintain the storyline across segments. Global semantics are integrated into each local feature to enhance context understanding, enabling the LLM to generate comprehensive responses for long-term videos. Experimental results on the VideoChatGPT benchmark and zero-shot video question-answering datasets show that LongVLM outperforms previous state-of-the-art methods in terms of accuracy and response quality. The model effectively captures both local and global information, allowing for precise and accurate responses to long-term video content. LongVLM is efficient and maintains affordable computational costs while achieving superior performance in long video understanding.LongVLM is a VideoLLM designed for efficient long video understanding. It addresses the challenge of detailed understanding in long-term videos by decomposing them into short-term segments and encoding local features for each segment through a hierarchical token merging module. These features are concatenated in temporal order to maintain the storyline across segments. Global semantics are integrated into each local feature to enhance context understanding, enabling the LLM to generate comprehensive responses for long-term videos. Experimental results on the VideoChatGPT benchmark and zero-shot video question-answering datasets show that LongVLM outperforms previous state-of-the-art methods in terms of accuracy and response quality. The model effectively captures both local and global information, allowing for precise and accurate responses to long-term video content. LongVLM is efficient and maintains affordable computational costs while achieving superior performance in long video understanding.
Reach us at info@study.space
[slides and audio] LongVLM%3A Efficient Long Video Understanding via Large Language Models