14 Mar 2024 | Tao Huang, Xiaohuan Pei, Shan You, Fei Wang, Chen Qian, and Chang Xu
LocalMamba is an innovative approach to visual state space models (SSMs) that enhances the capture of local dependencies within images while maintaining global contextual understanding. The key contributions of this paper include:
1. **Local Scan Mechanism**: LocalMamba introduces a novel local scanning strategy that divides images into distinct windows, effectively capturing local dependencies while maintaining a global perspective. This approach addresses the issue of traditional methods that flatten spatial tokens, which disrupts the natural 2D dependencies and weakens the model's ability to interpret spatial relationships.
2. **Dynamic Scan Direction Search**: To improve performance, LocalMamba proposes a dynamic method to independently search for the optimal scan choices for each layer. This method acknowledges the varying preferences for scan patterns across different network layers, enabling the model to identify and apply the most effective scanning combinations.
3. **Model Variants**: Two model variants, LocalVim and LocalVMamba, are designed with plain and hierarchical structures, respectively. Extensive experiments on image classification, object detection, and semantic segmentation tasks demonstrate significant improvements over prior methods. For example, LocalVim-T outperforms Vm-Ti by 3.1% on ImageNet with the same 1.5G FLOPs.
4. **Ablation Study**: The effectiveness of the local scan technique and the spatial and channel attention module (SCAttn) is evaluated through ablation studies. Results show that the local scan and SCAttn significantly enhance the model's performance.
5. **Scalability and Future Work**: The paper discusses the scalability of the approach to more complex and diverse visual tasks and the potential integration of advanced scanning strategies.
LocalMamba's superior performance and efficiency in handling long sequences make it a promising framework for vision tasks, opening new avenues for research in efficient and effective state space modeling.LocalMamba is an innovative approach to visual state space models (SSMs) that enhances the capture of local dependencies within images while maintaining global contextual understanding. The key contributions of this paper include:
1. **Local Scan Mechanism**: LocalMamba introduces a novel local scanning strategy that divides images into distinct windows, effectively capturing local dependencies while maintaining a global perspective. This approach addresses the issue of traditional methods that flatten spatial tokens, which disrupts the natural 2D dependencies and weakens the model's ability to interpret spatial relationships.
2. **Dynamic Scan Direction Search**: To improve performance, LocalMamba proposes a dynamic method to independently search for the optimal scan choices for each layer. This method acknowledges the varying preferences for scan patterns across different network layers, enabling the model to identify and apply the most effective scanning combinations.
3. **Model Variants**: Two model variants, LocalVim and LocalVMamba, are designed with plain and hierarchical structures, respectively. Extensive experiments on image classification, object detection, and semantic segmentation tasks demonstrate significant improvements over prior methods. For example, LocalVim-T outperforms Vm-Ti by 3.1% on ImageNet with the same 1.5G FLOPs.
4. **Ablation Study**: The effectiveness of the local scan technique and the spatial and channel attention module (SCAttn) is evaluated through ablation studies. Results show that the local scan and SCAttn significantly enhance the model's performance.
5. **Scalability and Future Work**: The paper discusses the scalability of the approach to more complex and diverse visual tasks and the potential integration of advanced scanning strategies.
LocalMamba's superior performance and efficiency in handling long sequences make it a promising framework for vision tasks, opening new avenues for research in efficient and effective state space modeling.