Streaming Long Video Understanding with Large Language Models

Streaming Long Video Understanding with Large Language Models

25 May 2024 | Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, Jiaqi Wang
This paper introduces VideoStreaming, an advanced vision-language large model (VLLM) for video understanding that can process arbitrary-length videos by streaming and adaptively selecting video tokens. The challenge of video understanding lies in the high computational cost of processing long videos, which previous methods have addressed through sparse sampling or frame compression, often sacrificing spatial details or temporal information. VideoStreaming addresses these limitations with two core designs: Memory-Propagated Streaming Encoding and Adaptive Memory Selection. The former segments long videos into short clips and sequentially encodes each clip with a propagated memory, integrating historical memory with the current clip to distill a condensed representation. The latter selects a constant number of question-related memories from all historical memories, enabling efficient and precise video understanding. The disentangled video extraction and reasoning design allows the LLM to answer different questions about a video by directly selecting corresponding memories without re-encoding the entire video. Extensive experiments show that VideoStreaming achieves superior performance and higher efficiency on long video benchmarks, demonstrating precise temporal comprehension for detailed question answering. The model uses a two-stage progressive training process and long-video data construction strategy to achieve these results. VideoStreaming is evaluated on various long video QA datasets, showing strong performance in terms of accuracy, temporal understanding, and inference efficiency. The model's adaptive memory selection and streaming encoding architecture enable efficient and accurate video understanding, making it a promising approach for long video understanding with large language models.This paper introduces VideoStreaming, an advanced vision-language large model (VLLM) for video understanding that can process arbitrary-length videos by streaming and adaptively selecting video tokens. The challenge of video understanding lies in the high computational cost of processing long videos, which previous methods have addressed through sparse sampling or frame compression, often sacrificing spatial details or temporal information. VideoStreaming addresses these limitations with two core designs: Memory-Propagated Streaming Encoding and Adaptive Memory Selection. The former segments long videos into short clips and sequentially encodes each clip with a propagated memory, integrating historical memory with the current clip to distill a condensed representation. The latter selects a constant number of question-related memories from all historical memories, enabling efficient and precise video understanding. The disentangled video extraction and reasoning design allows the LLM to answer different questions about a video by directly selecting corresponding memories without re-encoding the entire video. Extensive experiments show that VideoStreaming achieves superior performance and higher efficiency on long video benchmarks, demonstrating precise temporal comprehension for detailed question answering. The model uses a two-stage progressive training process and long-video data construction strategy to achieve these results. VideoStreaming is evaluated on various long video QA datasets, showing strong performance in terms of accuracy, temporal understanding, and inference efficiency. The model's adaptive memory selection and streaming encoding architecture enable efficient and accurate video understanding, making it a promising approach for long video understanding with large language models.
Reach us at info@study.space