26 May 2024 | Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Yunfan Liu
VMamba is a vision backbone network that integrates State Space Models (SSMs) to achieve linear time complexity. It introduces the 2D Selective Scan (SS2D) module to bridge the gap between 1D scanning and 2D vision data processing, enabling efficient visual representation learning. VMamba outperforms existing models in image classification, object detection, and semantic segmentation tasks, demonstrating superior input scaling efficiency. The architecture of VMamba is built using Visual State-Space (VSS) blocks, which replace the S6 module with SS2D to capture contextual information effectively. Through architectural and implementation enhancements, VMamba achieves high throughput and efficiency, making it suitable for downstream tasks with large-resolution inputs. The paper also discusses the relationship between SS2D and self-attention, showing that SS2D can capture and retain traversed information, leading to effective global receptive fields. VMamba's linear time complexity and efficient computation make it a promising approach for visual tasks.VMamba is a vision backbone network that integrates State Space Models (SSMs) to achieve linear time complexity. It introduces the 2D Selective Scan (SS2D) module to bridge the gap between 1D scanning and 2D vision data processing, enabling efficient visual representation learning. VMamba outperforms existing models in image classification, object detection, and semantic segmentation tasks, demonstrating superior input scaling efficiency. The architecture of VMamba is built using Visual State-Space (VSS) blocks, which replace the S6 module with SS2D to capture contextual information effectively. Through architectural and implementation enhancements, VMamba achieves high throughput and efficiency, making it suitable for downstream tasks with large-resolution inputs. The paper also discusses the relationship between SS2D and self-attention, showing that SS2D can capture and retain traversed information, leading to effective global receptive fields. VMamba's linear time complexity and efficient computation make it a promising approach for visual tasks.