[slides and audio] LongVLM%3A Efficient Long Video Understanding via Large Language Models

**LongVLM: Efficient Long Video Understanding via Large Language Models** **Authors:** Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, Bohan Zhuang **Institution:** ZIP Lab, Monash University; School of Information Science and Technology, University of Science and Technology of China; Department of Computer Vision, MBZUAI; ReLER, AAAI, UTS **Abstract:** Recent advancements in Video-based Large Language Models (VideoLLMs) have significantly improved video understanding tasks. However, existing VideoLLMs struggle with detailed understanding of long-term videos due to their overemphasis on local information. To address this, LongVLM introduces a simple yet powerful approach to decompose long videos into multiple short-term segments and encode local features for each segment using a hierarchical token merging module. These features are concatenated in temporal order to maintain the storyline. Additionally, global semantics are integrated into each local feature to enhance context understanding. Experimental results on the VideoChatGPT benchmark and zero-shot video question-answering datasets demonstrate that LongVLM outperforms previous state-of-the-art methods, providing more precise and accurate responses for long-term video understanding. **Contributions:** 1. **Proposed Model:** LongVLM, a simple yet effective VideoLLM for efficient long-term video understanding. 2. **Decomposition and Aggregation:** LongVLM decomposes long videos into short-term segments and extracts local features for each segment, preserving their temporal order. 3. **Global Semantics Integration:** Global semantic information is integrated into local features to enhance context understanding. **Methods:** - **Local Feature Aggregation:** A hierarchical token merging module is used to aggregate visual tokens within each short-term segment, reducing computational costs. - **Global Semantics Integration:** [CLS] tokens from each video frame are averaged and concatenated with local features to enrich the context. **Experiments:** - **Evaluation on VideoChatGPT Benchmark:** LongVLM outperforms other models in detail orientation and consistency. - **Zero-Shot Video Question-_answering:** LongVLM achieves the highest accuracy and quality scores on multiple datasets. **Conclusion:** LongVLM effectively captures detailed information in long-term videos, providing consistent and accurate responses. Future work may explore extending the model to video-centric multimodal generation tasks and training it on larger, extended-duration videos.**LongVLM: Efficient Long Video Understanding via Large Language Models** **Authors:** Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, Bohan Zhuang **Institution:** ZIP Lab, Monash University; School of Information Science and Technology, University of Science and Technology of China; Department of Computer Vision, MBZUAI; ReLER, AAAI, UTS **Abstract:** Recent advancements in Video-based Large Language Models (VideoLLMs) have significantly improved video understanding tasks. However, existing VideoLLMs struggle with detailed understanding of long-term videos due to their overemphasis on local information. To address this, LongVLM introduces a simple yet powerful approach to decompose long videos into multiple short-term segments and encode local features for each segment using a hierarchical token merging module. These features are concatenated in temporal order to maintain the storyline. Additionally, global semantics are integrated into each local feature to enhance context understanding. Experimental results on the VideoChatGPT benchmark and zero-shot video question-answering datasets demonstrate that LongVLM outperforms previous state-of-the-art methods, providing more precise and accurate responses for long-term video understanding. **Contributions:** 1. **Proposed Model:** LongVLM, a simple yet effective VideoLLM for efficient long-term video understanding. 2. **Decomposition and Aggregation:** LongVLM decomposes long videos into short-term segments and extracts local features for each segment, preserving their temporal order. 3. **Global Semantics Integration:** Global semantic information is integrated into local features to enhance context understanding. **Methods:** - **Local Feature Aggregation:** A hierarchical token merging module is used to aggregate visual tokens within each short-term segment, reducing computational costs. - **Global Semantics Integration:** [CLS] tokens from each video frame are averaged and concatenated with local features to enrich the context. **Experiments:** - **Evaluation on VideoChatGPT Benchmark:** LongVLM outperforms other models in detail orientation and consistency. - **Zero-Shot Video Question-_answering:** LongVLM achieves the highest accuracy and quality scores on multiple datasets. **Conclusion:** LongVLM effectively captures detailed information in long-term videos, providing consistent and accurate responses. Future work may explore extending the model to video-centric multimodal generation tasks and training it on larger, extended-duration videos.

LongVLM: Efficient Long Video Understanding via Large Language Models

20 Jul 2024 | Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, Bohan Zhuang