Multi-Scale VMamba: Hierarchy in Hierarchy Visual State Space Model

Multi-Scale VMamba: Hierarchy in Hierarchy Visual State Space Model

23 May 2024 | Yuheng Shi, Minjing Dong, Chang Xu
Multi-Scale VMamba (MSVMamba) is a vision backbone based on State Space Models (SSMs) that addresses the long-range forgetting problem in parameter-constrained vision models. The method introduces a multi-scale 2D scanning strategy to reduce computational redundancy while preserving the global receptive field and linear complexity of SSMs. It employs a hierarchical design, combining multi-scale scanning with a Convolutional Feed-Forward Network (ConvFFN) to enhance channel mixing and local feature capture. MSVMamba achieves high performance on tasks such as image classification, object detection, and semantic segmentation, with the MSVMamba-Tiny model achieving 82.8% top-1 accuracy on ImageNet, 46.9% box mAP, and 42.2% instance mAP with the Mask R-CNN framework, and 47.6% mIoU on ADE20K. The model is efficient, with reduced computational costs compared to existing SSM-based approaches. The method also demonstrates scalability, with variants such as Nano, Micro, and Tiny achieving high accuracy with varying parameter counts and computational demands. The proposed approach improves the ability of vision models to capture long-range dependencies while maintaining efficiency.Multi-Scale VMamba (MSVMamba) is a vision backbone based on State Space Models (SSMs) that addresses the long-range forgetting problem in parameter-constrained vision models. The method introduces a multi-scale 2D scanning strategy to reduce computational redundancy while preserving the global receptive field and linear complexity of SSMs. It employs a hierarchical design, combining multi-scale scanning with a Convolutional Feed-Forward Network (ConvFFN) to enhance channel mixing and local feature capture. MSVMamba achieves high performance on tasks such as image classification, object detection, and semantic segmentation, with the MSVMamba-Tiny model achieving 82.8% top-1 accuracy on ImageNet, 46.9% box mAP, and 42.2% instance mAP with the Mask R-CNN framework, and 47.6% mIoU on ADE20K. The model is efficient, with reduced computational costs compared to existing SSM-based approaches. The method also demonstrates scalability, with variants such as Nano, Micro, and Tiny achieving high accuracy with varying parameter counts and computational demands. The proposed approach improves the ability of vision models to capture long-range dependencies while maintaining efficiency.
Reach us at info@study.space
[slides] Multi-Scale VMamba%3A Hierarchy in Hierarchy Visual State Space Model | StudySpace