26 Apr 2024 | Enxin Song*, Wenhao Chai*, Tian Ye, Jenq-Neng Hwang, Life Fellow, IEEE, Xi Li, Senior Member, IEEE, Gaoang Wang, Member, IEEE
MovieChat+ is a framework designed for long video understanding, leveraging pre-trained multimodal large language models (MLLMs) and a question-aware sparse memory mechanism. It addresses the challenges of long-term temporal connections and high computational costs in video understanding. Unlike previous methods that rely on complex modules or additional perception tools, MovieChat+ employs a zero-shot approach, using a memory consolidation mechanism that enhances the alignment of visual language models with relevant visual content. The framework includes a short-term memory module and a long-term memory module with question-aware consolidation, which significantly improves performance in both short and long video question-answering tasks. MovieChat+ achieves state-of-the-art results on the MovieChat-1K benchmark, which includes 1,000 long videos, 2,000 temporal grounding labels, and 14,000 manual annotations. The framework supports two inference modes: global mode for understanding the entire video and breakpoint mode for understanding specific moments. The memory mechanism reduces redundancy in visual tokens and enables efficient compression of video features based on question relevance. The framework also includes a new benchmark, MovieChat-1K, which provides a diverse set of video clips and annotations for evaluating long video understanding. The results show that MovieChat+ outperforms previous methods in terms of accuracy and efficiency, demonstrating the effectiveness of the question-aware memory consolidation strategy.MovieChat+ is a framework designed for long video understanding, leveraging pre-trained multimodal large language models (MLLMs) and a question-aware sparse memory mechanism. It addresses the challenges of long-term temporal connections and high computational costs in video understanding. Unlike previous methods that rely on complex modules or additional perception tools, MovieChat+ employs a zero-shot approach, using a memory consolidation mechanism that enhances the alignment of visual language models with relevant visual content. The framework includes a short-term memory module and a long-term memory module with question-aware consolidation, which significantly improves performance in both short and long video question-answering tasks. MovieChat+ achieves state-of-the-art results on the MovieChat-1K benchmark, which includes 1,000 long videos, 2,000 temporal grounding labels, and 14,000 manual annotations. The framework supports two inference modes: global mode for understanding the entire video and breakpoint mode for understanding specific moments. The memory mechanism reduces redundancy in visual tokens and enables efficient compression of video features based on question relevance. The framework also includes a new benchmark, MovieChat-1K, which provides a diverse set of video clips and annotations for evaluating long video understanding. The results show that MovieChat+ outperforms previous methods in terms of accuracy and efficiency, demonstrating the effectiveness of the question-aware memory consolidation strategy.