[slides and audio] Streaming Long Video Understanding with Large Language Models

This paper introduces VideoStreaming, an advanced vision-language large model (VLLM) designed for understanding long videos. The primary challenge in video understanding is the computational burden caused by the large number of tokens extracted from long videos. Previous methods often rely on sparse sampling or frame compression, which can lead to information loss or temporal information disregard. To address these limitations, VideoStreaming employs two core designs: Memory-Propagated Streaming Encoding and Adaptive Memory Selection. 1. **Memory-Propagated Streaming Encoding**: This architecture segments long videos into short clips and sequentially encodes each clip using a propagated memory. Each iteration integrates the encoded results of the preceding clip as historical memory with the current clip, distilling a condensed representation that captures the video content up to the current timestamp. This method incorporates long-term temporal dynamics and produces a fixed-length memory for arbitrarily long videos. 2. **Adaptive Memory Selection**: After encoding, the model selects a constant number of question-related memories from all historical memories and feeds them into the LLM to generate informative responses. This reduces redundancy and enables efficient and precise video understanding. The paper also details the training process, which involves a two-stage progressive approach. In the first stage, a small language model is trained on single-clip encoding tasks. In the second stage, the model is jointly trained with the LLM for long video understanding. Extensive experiments demonstrate that VideoStreaming achieves superior performance and higher efficiency on long video benchmarks, showcasing precise temporal comprehension and detailed question answering capabilities.This paper introduces VideoStreaming, an advanced vision-language large model (VLLM) designed for understanding long videos. The primary challenge in video understanding is the computational burden caused by the large number of tokens extracted from long videos. Previous methods often rely on sparse sampling or frame compression, which can lead to information loss or temporal information disregard. To address these limitations, VideoStreaming employs two core designs: Memory-Propagated Streaming Encoding and Adaptive Memory Selection. 1. **Memory-Propagated Streaming Encoding**: This architecture segments long videos into short clips and sequentially encodes each clip using a propagated memory. Each iteration integrates the encoded results of the preceding clip as historical memory with the current clip, distilling a condensed representation that captures the video content up to the current timestamp. This method incorporates long-term temporal dynamics and produces a fixed-length memory for arbitrarily long videos. 2. **Adaptive Memory Selection**: After encoding, the model selects a constant number of question-related memories from all historical memories and feeds them into the LLM to generate informative responses. This reduces redundancy and enables efficient and precise video understanding. The paper also details the training process, which involves a two-stage progressive approach. In the first stage, a small language model is trained on single-clip encoding tasks. In the second stage, the model is jointly trained with the LLM for long video understanding. Extensive experiments demonstrate that VideoStreaming achieves superior performance and higher efficiency on long video benchmarks, showcasing precise temporal comprehension and detailed question answering capabilities.

Streaming Long Video Understanding with Large Language Models

25 May 2024 | Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, Jiaqi Wang