12 Mar 2024 | Kunchang Li, Xinshao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, Yu Qiao
**VideoMamba: State Space Model for Efficient Video Understanding**
This paper introduces VideoMamba, a state space model (SSM) designed for efficient video understanding. VideoMamba addresses the challenges of local redundancy and global dependencies in video data, leveraging the linear-complexity operator of SSMs to enable efficient long-term modeling. Key contributions and findings include:
1. **Scalability in the Visual Domain**: VideoMamba demonstrates remarkable scalability without extensive dataset pretraining, thanks to a novel self-distillation technique.
2. **Sensitivity for Short-term Action Recognition**: It excels in recognizing short-term actions, even with fine-grained motion differences, outperforming existing attention-based models.
3. **Superiority in Long-term Video Understanding**: VideoMamba shows significant advancements over traditional feature-based models, achieving faster processing speeds and lower GPU memory requirements.
4. **Compatibility with Other Modalities**: It performs robustly in multi-modal contexts, particularly in video-text retrievals.
The paper also discusses related works, including state space models and video understanding techniques, and provides a detailed methodological overview, including the architecture and experimental results. VideoMamba is shown to outperform state-of-the-art models in various benchmarks, making it a promising solution for comprehensive video understanding. All code and models are open-sourced.**VideoMamba: State Space Model for Efficient Video Understanding**
This paper introduces VideoMamba, a state space model (SSM) designed for efficient video understanding. VideoMamba addresses the challenges of local redundancy and global dependencies in video data, leveraging the linear-complexity operator of SSMs to enable efficient long-term modeling. Key contributions and findings include:
1. **Scalability in the Visual Domain**: VideoMamba demonstrates remarkable scalability without extensive dataset pretraining, thanks to a novel self-distillation technique.
2. **Sensitivity for Short-term Action Recognition**: It excels in recognizing short-term actions, even with fine-grained motion differences, outperforming existing attention-based models.
3. **Superiority in Long-term Video Understanding**: VideoMamba shows significant advancements over traditional feature-based models, achieving faster processing speeds and lower GPU memory requirements.
4. **Compatibility with Other Modalities**: It performs robustly in multi-modal contexts, particularly in video-text retrievals.
The paper also discusses related works, including state space models and video understanding techniques, and provides a detailed methodological overview, including the architecture and experimental results. VideoMamba is shown to outperform state-of-the-art models in various benchmarks, making it a promising solution for comprehensive video understanding. All code and models are open-sourced.