30 Jun 2024 | Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jifeng Dai, Jiashi Feng, Xiaojie Jin
Flash-VStream is a video-language model designed for real-time processing of long video streams and answering user questions. It incorporates a memory mechanism called STAR (Spatial-Temporal-Abstract-Retrieved) that enables efficient compression and retrieval of visual information. The model processes video streams in real-time, reducing inference latency and VRAM consumption compared to existing methods. Flash-VStream is also capable of handling extremely long video streams, which is a challenge for traditional models due to the need to store and process large amounts of visual data.
To evaluate the performance of Flash-VStream, a new benchmark called VStream-QA was proposed, specifically designed for online video stream understanding. VStream-QA includes two parts: VStream-QA-Ego for first-person perspective videos and VStream-QA-Movie for third-person perspective movies. The benchmark features videos of significant length (30 to 60 minutes) and includes question-answer pairs marked with specific timestamps, ensuring that answers are based on visual information before the timestamp.
Flash-VStream outperforms existing methods on both the new VStream-QA benchmark and traditional offline video understanding benchmarks. It achieves state-of-the-art performance in terms of inference latency and VRAM consumption, making it suitable for real-time video stream understanding. The STAR memory mechanism allows the model to efficiently process long video streams by compressing visual information and updating memory in real-time. The model's ability to handle long video streams and answer user questions simultaneously makes it a significant advancement in the field of online video understanding.Flash-VStream is a video-language model designed for real-time processing of long video streams and answering user questions. It incorporates a memory mechanism called STAR (Spatial-Temporal-Abstract-Retrieved) that enables efficient compression and retrieval of visual information. The model processes video streams in real-time, reducing inference latency and VRAM consumption compared to existing methods. Flash-VStream is also capable of handling extremely long video streams, which is a challenge for traditional models due to the need to store and process large amounts of visual data.
To evaluate the performance of Flash-VStream, a new benchmark called VStream-QA was proposed, specifically designed for online video stream understanding. VStream-QA includes two parts: VStream-QA-Ego for first-person perspective videos and VStream-QA-Movie for third-person perspective movies. The benchmark features videos of significant length (30 to 60 minutes) and includes question-answer pairs marked with specific timestamps, ensuring that answers are based on visual information before the timestamp.
Flash-VStream outperforms existing methods on both the new VStream-QA benchmark and traditional offline video understanding benchmarks. It achieves state-of-the-art performance in terms of inference latency and VRAM consumption, making it suitable for real-time video stream understanding. The STAR memory mechanism allows the model to efficiently process long video streams by compressing visual information and updating memory in real-time. The model's ability to handle long video streams and answer user questions simultaneously makes it a significant advancement in the field of online video understanding.