15 Mar 2024 | Xiaohuan Pei*, Tao Huang*, and Chang Xu
EfficientVMamba is a lightweight visual state space model that addresses the trade-off between accuracy and computational efficiency in vision tasks. Inspired by state space models (SSMs) like Mamba, which achieve linear time complexity for global information extraction, EfficientVMamba integrates an atrous-based selective scan approach with efficient skip sampling to reduce computational complexity while preserving global receptive fields. This approach enables the model to efficiently extract both global and local features through a combination of SSM blocks and convolutional branches, with a Squeeze-and-Excitation module to balance feature integration. The model also introduces an inverted residual insertion strategy, placing global representation modules in early stages and local feature extraction in later stages, which enhances efficiency and performance.
EfficientVMamba achieves significant improvements in various vision tasks, including image classification, object detection, and semantic segmentation. For example, EfficientVMamba-S with 1.3G FLOPs outperforms VimTi with 1.5G FLOPs by 5.6% in ImageNet classification. The model's architecture includes an efficient 2D scanning method (ES2D) that reduces the number of tokens processed during scanning, improving feature extraction efficiency. The model also incorporates a dual-pathway module that combines global and local feature extraction, with a channel attention module to balance their integration.
Experiments show that EfficientVMamba achieves competitive performance with significantly reduced computational complexity compared to existing lightweight models. The model variants, including EfficientVMamba-T, EfficientVMamba-S, and EfficientVMamba-B, demonstrate varying levels of efficiency and accuracy, with EfficientVMamba-B achieving a Top-1 accuracy of 81.8% on ImageNet with 4.0G FLOPs. In object detection, EfficientVMamba-T achieves an AP of 37.5%, surpassing ResNet-18. In semantic segmentation, EfficientVMamba-T achieves mIoUs of 38.9% and 39.3%, outperforming ResNet-50 with fewer parameters.
The model's design emphasizes efficiency through selective scanning, SSM-conv fusion, and inverted residual insertion, making it suitable for resource-constrained environments. EfficientVMamba demonstrates the potential of state space models in lightweight vision tasks, offering a balance between accuracy and computational efficiency.EfficientVMamba is a lightweight visual state space model that addresses the trade-off between accuracy and computational efficiency in vision tasks. Inspired by state space models (SSMs) like Mamba, which achieve linear time complexity for global information extraction, EfficientVMamba integrates an atrous-based selective scan approach with efficient skip sampling to reduce computational complexity while preserving global receptive fields. This approach enables the model to efficiently extract both global and local features through a combination of SSM blocks and convolutional branches, with a Squeeze-and-Excitation module to balance feature integration. The model also introduces an inverted residual insertion strategy, placing global representation modules in early stages and local feature extraction in later stages, which enhances efficiency and performance.
EfficientVMamba achieves significant improvements in various vision tasks, including image classification, object detection, and semantic segmentation. For example, EfficientVMamba-S with 1.3G FLOPs outperforms VimTi with 1.5G FLOPs by 5.6% in ImageNet classification. The model's architecture includes an efficient 2D scanning method (ES2D) that reduces the number of tokens processed during scanning, improving feature extraction efficiency. The model also incorporates a dual-pathway module that combines global and local feature extraction, with a channel attention module to balance their integration.
Experiments show that EfficientVMamba achieves competitive performance with significantly reduced computational complexity compared to existing lightweight models. The model variants, including EfficientVMamba-T, EfficientVMamba-S, and EfficientVMamba-B, demonstrate varying levels of efficiency and accuracy, with EfficientVMamba-B achieving a Top-1 accuracy of 81.8% on ImageNet with 4.0G FLOPs. In object detection, EfficientVMamba-T achieves an AP of 37.5%, surpassing ResNet-18. In semantic segmentation, EfficientVMamba-T achieves mIoUs of 38.9% and 39.3%, outperforming ResNet-50 with fewer parameters.
The model's design emphasizes efficiency through selective scanning, SSM-conv fusion, and inverted residual insertion, making it suitable for resource-constrained environments. EfficientVMamba demonstrates the potential of state space models in lightweight vision tasks, offering a balance between accuracy and computational efficiency.