[slides] VMamba%3A Visual State Space Model

VMamba is an efficient vision backbone network designed to process visual data with linear time complexity. It integrates the benefits of State Space Models (SSMs) from natural language processing, specifically the 2D Selective Scan (SS2D) module, which bridges the gap between ordered 1D scanning and non-sequential 2D data traversal. The core of VMamba consists of Visual State-Space (VSS) blocks, which are adapted from the Mamba blocks in NLP. These blocks facilitate the gathering of contextual information from various sources and perspectives, enhancing the model's ability to capture global receptive fields and dynamic weighting parameters. VMamba is developed in three scales: VMamba-Tiny, VMamba-Small, and VMamba-Base. The input image is partitioned into patches, and multiple stages are used to create hierarchical representations. Each stage includes a down-sampling layer followed by VSS blocks. The VSS blocks are designed to be computationally efficient, with a single network branch consisting of two residual modules, similar to a vanilla Transformer block. The effectiveness of VMamba is demonstrated through extensive experiments on various visual tasks, including image classification, object detection, and semantic segmentation. VMamba outperforms existing benchmark models such as Swin and ConvNeXt in terms of classification accuracy and computational efficiency. It achieves superior performance on ImageNet-1K, with a top-1 accuracy of 83.9% for VMamba-Base, and shows linear growth in FLOPs compared to quadratic growth in ViT-based models. VMamba also demonstrates adaptability to different input resolutions, maintaining high accuracy and throughput. The paper discusses the relationship between SS2D and self-attention mechanisms, visualizing attention and activation maps to illustrate the effectiveness of the proposed scanning approach. Additionally, it analyzes the effective receptive field (ERF) to show that VMamba can achieve global ERFs, unlike other models with local ERFs. The study concludes by highlighting the advantages of VMamba in efficient long sequence modeling and its potential for future research in architectural design and pre-training techniques.VMamba is an efficient vision backbone network designed to process visual data with linear time complexity. It integrates the benefits of State Space Models (SSMs) from natural language processing, specifically the 2D Selective Scan (SS2D) module, which bridges the gap between ordered 1D scanning and non-sequential 2D data traversal. The core of VMamba consists of Visual State-Space (VSS) blocks, which are adapted from the Mamba blocks in NLP. These blocks facilitate the gathering of contextual information from various sources and perspectives, enhancing the model's ability to capture global receptive fields and dynamic weighting parameters. VMamba is developed in three scales: VMamba-Tiny, VMamba-Small, and VMamba-Base. The input image is partitioned into patches, and multiple stages are used to create hierarchical representations. Each stage includes a down-sampling layer followed by VSS blocks. The VSS blocks are designed to be computationally efficient, with a single network branch consisting of two residual modules, similar to a vanilla Transformer block. The effectiveness of VMamba is demonstrated through extensive experiments on various visual tasks, including image classification, object detection, and semantic segmentation. VMamba outperforms existing benchmark models such as Swin and ConvNeXt in terms of classification accuracy and computational efficiency. It achieves superior performance on ImageNet-1K, with a top-1 accuracy of 83.9% for VMamba-Base, and shows linear growth in FLOPs compared to quadratic growth in ViT-based models. VMamba also demonstrates adaptability to different input resolutions, maintaining high accuracy and throughput. The paper discusses the relationship between SS2D and self-attention mechanisms, visualizing attention and activation maps to illustrate the effectiveness of the proposed scanning approach. Additionally, it analyzes the effective receptive field (ERF) to show that VMamba can achieve global ERFs, unlike other models with local ERFs. The study concludes by highlighting the advantages of VMamba in efficient long sequence modeling and its potential for future research in architectural design and pre-training techniques.

VMamba: Visual State Space Model

26 May 2024 | Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, Yunfan Liu