MovieChat+: Question-aware Sparse Memory for Long Video Question Answering

MovieChat+: Question-aware Sparse Memory for Long Video Question Answering

26 Apr 2024 | Enxin Song*, Wenhao Chai*, Tian Ye, Jenq-Neng Hwang, Life Fellow, IEEE, Xi Li, Senior Member, IEEE, Gaoang Wang, Member, IEEE
The paper introduces MovieChat, a framework designed to understand long videos (>10K frames) using pre-trained multi-modal large language models (MLLMs) without additional trainable temporal modules. The key challenge addressed is the computational complexity and memory costs associated with long-term temporal connections in video understanding tasks. Inspired by the Atkinson-Shiffrin memory model, MovieChat employs a zero-shot approach with a memory mechanism that includes a short-term memory and a long-term memory. The long-term memory is further enhanced in MovieChat+ with a vision-question matching-based memory consolidation mechanism, which significantly improves the compactness of the memory and anchors the predictions of visual language models to relevant visual content. The authors release the MovieChat-1K benchmark, which includes 1K long videos, 2K temporal grounding labels, and 14K manual annotations, to evaluate the effectiveness of their method. Extensive quantitative evaluations and case studies demonstrate that MovieChat outperforms state-of-the-art methods in both short and long video question-answering tasks, achieving superior performance in terms of accuracy and generative quality.The paper introduces MovieChat, a framework designed to understand long videos (>10K frames) using pre-trained multi-modal large language models (MLLMs) without additional trainable temporal modules. The key challenge addressed is the computational complexity and memory costs associated with long-term temporal connections in video understanding tasks. Inspired by the Atkinson-Shiffrin memory model, MovieChat employs a zero-shot approach with a memory mechanism that includes a short-term memory and a long-term memory. The long-term memory is further enhanced in MovieChat+ with a vision-question matching-based memory consolidation mechanism, which significantly improves the compactness of the memory and anchors the predictions of visual language models to relevant visual content. The authors release the MovieChat-1K benchmark, which includes 1K long videos, 2K temporal grounding labels, and 14K manual annotations, to evaluate the effectiveness of their method. Extensive quantitative evaluations and case studies demonstrate that MovieChat outperforms state-of-the-art methods in both short and long video question-answering tasks, achieving superior performance in terms of accuracy and generative quality.
Reach us at info@study.space