[slides] VideoMamba%3A State Space Model for Efficient Video Understanding

**VideoMamba: State Space Model for Efficient Video Understanding** This paper introduces VideoMamba, a state space model (SSM) designed for efficient video understanding. VideoMamba addresses the challenges of local redundancy and global dependencies in video data, leveraging the linear-complexity operator of SSMs to enable efficient long-term modeling. Key contributions and findings include: 1. **Scalability in the Visual Domain**: VideoMamba demonstrates remarkable scalability without extensive dataset pretraining, thanks to a novel self-distillation technique. 2. **Sensitivity for Short-term Action Recognition**: It excels in recognizing short-term actions, even with fine-grained motion differences, outperforming existing attention-based models. 3. **Superiority in Long-term Video Understanding**: VideoMamba shows significant advancements over traditional feature-based models, achieving faster processing speeds and lower GPU memory requirements. 4. **Compatibility with Other Modalities**: It performs robustly in multi-modal contexts, particularly in video-text retrievals. The paper also discusses related works, including state space models and video understanding techniques, and provides a detailed methodological overview, including the architecture and experimental results. VideoMamba is shown to outperform state-of-the-art models in various benchmarks, making it a promising solution for comprehensive video understanding. All code and models are open-sourced.**VideoMamba: State Space Model for Efficient Video Understanding** This paper introduces VideoMamba, a state space model (SSM) designed for efficient video understanding. VideoMamba addresses the challenges of local redundancy and global dependencies in video data, leveraging the linear-complexity operator of SSMs to enable efficient long-term modeling. Key contributions and findings include: 1. **Scalability in the Visual Domain**: VideoMamba demonstrates remarkable scalability without extensive dataset pretraining, thanks to a novel self-distillation technique. 2. **Sensitivity for Short-term Action Recognition**: It excels in recognizing short-term actions, even with fine-grained motion differences, outperforming existing attention-based models. 3. **Superiority in Long-term Video Understanding**: VideoMamba shows significant advancements over traditional feature-based models, achieving faster processing speeds and lower GPU memory requirements. 4. **Compatibility with Other Modalities**: It performs robustly in multi-modal contexts, particularly in video-text retrievals. The paper also discusses related works, including state space models and video understanding techniques, and provides a detailed methodological overview, including the architecture and experimental results. VideoMamba is shown to outperform state-of-the-art models in various benchmarks, making it a promising solution for comprehensive video understanding. All code and models are open-sourced.

VideoMamba: State Space Model for Efficient Video Understanding

12 Mar 2024 | Kunchang Li, Xinshao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, Yu Qiao