Multi-Scale VMamba: Hierarchy in Hierarchy Visual State Space Model

Multi-Scale VMamba: Hierarchy in Hierarchy Visual State Space Model

23 May 2024 | Yuheng Shi, Minjing Dong, Chang Xu
The paper introduces Multi-Scale Vision Mamba (MSVMamba), a State Space Model (SSM)-based vision backbone that combines the advantages of linear complexity and global receptive fields. To address the long-range forgetting problem in parameter-limited vision models, MSVMamba employs a Multi-Scale 2D (MS2D) scanning technique, which reduces computational redundancy and enhances the model's ability to capture long-range dependencies. Additionally, the model integrates a Convolutional Feed-Forward Network (ConvFFN) to improve channel-wise information exchange and local feature capture. Experiments on ImageNet, COCO, and ADE20K datasets demonstrate that MSVMamba outperforms various models, including CNNs, ViTs, and SSM-based models, with improved efficiency and accuracy. The proposed MS2D scanning technique and ConvFFN integration significantly enhance the model's performance, making it a robust and scalable solution for high-accuracy, resource-efficient model design in computer vision tasks.The paper introduces Multi-Scale Vision Mamba (MSVMamba), a State Space Model (SSM)-based vision backbone that combines the advantages of linear complexity and global receptive fields. To address the long-range forgetting problem in parameter-limited vision models, MSVMamba employs a Multi-Scale 2D (MS2D) scanning technique, which reduces computational redundancy and enhances the model's ability to capture long-range dependencies. Additionally, the model integrates a Convolutional Feed-Forward Network (ConvFFN) to improve channel-wise information exchange and local feature capture. Experiments on ImageNet, COCO, and ADE20K datasets demonstrate that MSVMamba outperforms various models, including CNNs, ViTs, and SSM-based models, with improved efficiency and accuracy. The proposed MS2D scanning technique and ConvFFN integration significantly enhance the model's performance, making it a robust and scalable solution for high-accuracy, resource-efficient model design in computer vision tasks.
Reach us at info@study.space