MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

24 Apr 2024 | Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, Ser-Nam Lim
MA-LMM is a memory-augmented large multimodal model designed for long-term video understanding. Unlike existing models that process a limited number of frames, MA-LMM processes videos online, storing past video information in a memory bank. This allows the model to reference historical content for long-term analysis without exceeding LLM context length or GPU memory limits. The memory bank is seamlessly integrated into current multimodal LLMs. Extensive experiments on tasks like long-video understanding, video question answering, and video captioning show that MA-LMM achieves state-of-the-art performance across multiple datasets. The model's key contribution is the long-term memory bank, which captures and aggregates historical video information in an auto-regressive manner. This enables efficient long-term video modeling. The memory bank is compatible with the Q-Former and can be integrated into existing models. To enhance efficiency, a memory bank compression method is proposed, maintaining the memory bank length relative to the input video length. By selecting and averaging similar adjacent frame features, it preserves temporal information while reducing redundancy. MA-LMM significantly reduces GPU memory usage and addresses LLM context length limitations by processing video sequences online. It achieves new state-of-the-art results on various video tasks, including long-term video understanding, video question answering, and video captioning. The model's design allows for efficient and effective long-term video modeling, offering significant advantages over prior approaches.MA-LMM is a memory-augmented large multimodal model designed for long-term video understanding. Unlike existing models that process a limited number of frames, MA-LMM processes videos online, storing past video information in a memory bank. This allows the model to reference historical content for long-term analysis without exceeding LLM context length or GPU memory limits. The memory bank is seamlessly integrated into current multimodal LLMs. Extensive experiments on tasks like long-video understanding, video question answering, and video captioning show that MA-LMM achieves state-of-the-art performance across multiple datasets. The model's key contribution is the long-term memory bank, which captures and aggregates historical video information in an auto-regressive manner. This enables efficient long-term video modeling. The memory bank is compatible with the Q-Former and can be integrated into existing models. To enhance efficiency, a memory bank compression method is proposed, maintaining the memory bank length relative to the input video length. By selecting and averaging similar adjacent frame features, it preserves temporal information while reducing redundancy. MA-LMM significantly reduces GPU memory usage and addresses LLM context length limitations by processing video sequences online. It achieves new state-of-the-art results on various video tasks, including long-term video understanding, video question answering, and video captioning. The model's design allows for efficient and effective long-term video modeling, offering significant advantages over prior approaches.
Reach us at info@study.space
Understanding MA-LMM%3A Memory-Augmented Large Multimodal Model for Long-Term Video Understanding