[slides] Flash-VStream%3A Memory-Based Real-Time Understanding for Long Video Streams

**Introduction:** Online video streaming is a prevalent media format with broad applications, such as robotics and surveillance systems. However, existing large video-language models struggle with real-time, long video question-answering due to high VRAM consumption and inference latency. This paper introduces Flash-VStream, a novel video-language model inspired by human memory mechanisms, designed to process extremely long video streams in real-time and respond to user queries simultaneously. **Model Overview:** Flash-VStream employs a Spatial-Temporal-Abstract-Retrieved (STAR) memory mechanism to compress and update visual information in real-time. The model consists of a streaming visual encoder, STAR memory, and a large language model (LLM) decoder. The visual encoder continuously processes video frames, the STAR memory manages compressed visual information, and the LLM decoder provides real-time responses to user queries. **VStream-QA Benchmark:** To evaluate Flash-VStream, a new benchmark called VStream-QA is proposed. It consists of two parts: VStream-QA-Ego for ego-centric videos and VStream-QA-Movie for movie clips. Each question-answer pair is marked with a specific timestamp, and the videos range from 30 to 60 minutes, significantly longer than existing benchmarks. The benchmark covers various video sources and question types, including scene summary, action description, event occurrence, ordered event narrative, and sequence validation. **Experimental Results:** Flash-VStream outperforms existing methods on VStream-QA, demonstrating superior performance in real-time video stream understanding. It also maintains state-of-the-art performance on conventional offline video understanding benchmarks. Ablation studies show that the STAR memory mechanism effectively compresses visual information and enhances the model's ability to understand long videos. **Conclusion:** Flash-VStream is a novel video-language model that enables real-time, long video stream understanding and question answering. It introduces a smart memory mechanism and significantly reduces inference latency and VRAM consumption. The proposed VStream-QA benchmark fills a gap in existing benchmarks and provides a comprehensive evaluation of online video stream understanding.**Introduction:** Online video streaming is a prevalent media format with broad applications, such as robotics and surveillance systems. However, existing large video-language models struggle with real-time, long video question-answering due to high VRAM consumption and inference latency. This paper introduces Flash-VStream, a novel video-language model inspired by human memory mechanisms, designed to process extremely long video streams in real-time and respond to user queries simultaneously. **Model Overview:** Flash-VStream employs a Spatial-Temporal-Abstract-Retrieved (STAR) memory mechanism to compress and update visual information in real-time. The model consists of a streaming visual encoder, STAR memory, and a large language model (LLM) decoder. The visual encoder continuously processes video frames, the STAR memory manages compressed visual information, and the LLM decoder provides real-time responses to user queries. **VStream-QA Benchmark:** To evaluate Flash-VStream, a new benchmark called VStream-QA is proposed. It consists of two parts: VStream-QA-Ego for ego-centric videos and VStream-QA-Movie for movie clips. Each question-answer pair is marked with a specific timestamp, and the videos range from 30 to 60 minutes, significantly longer than existing benchmarks. The benchmark covers various video sources and question types, including scene summary, action description, event occurrence, ordered event narrative, and sequence validation. **Experimental Results:** Flash-VStream outperforms existing methods on VStream-QA, demonstrating superior performance in real-time video stream understanding. It also maintains state-of-the-art performance on conventional offline video understanding benchmarks. Ablation studies show that the STAR memory mechanism effectively compresses visual information and enhances the model's ability to understand long videos. **Conclusion:** Flash-VStream is a novel video-language model that enables real-time, long video stream understanding and question answering. It introduces a smart memory mechanism and significantly reduces inference latency and VRAM consumption. The proposed VStream-QA benchmark fills a gap in existing benchmarks and provides a comprehensive evaluation of online video stream understanding.

Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams

30 Jun 2024 | Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jifeng Dai, Jiashi Feng, Xiaojie Jin