Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

2024-02-10 | Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, Xinggang Wang
Vision Mamba (Vim) is a novel vision backbone that leverages bidirectional state space models (SSMs) for efficient visual representation learning. Unlike traditional vision transformers (ViTs) that rely on self-attention, Vim uses a Mamba-inspired SSM to model visual data, achieving superior performance and efficiency. Vim processes images by splitting them into patches, linearly projecting them into sequences, and applying bidirectional SSMs with position embeddings to capture global visual context. This approach enables Vim to outperform ViTs in tasks such as image classification, semantic segmentation, and object detection, while significantly reducing computational and memory costs. For example, Vim is 2.8× faster and saves 86.8% GPU memory compared to DeiT when processing images of size 1248×1248. Vim's efficiency is further enhanced by its hardware-aware design, which optimizes memory and computation for high-resolution images. The model is trained on the ImageNet dataset and can be used as a backbone for various downstream tasks, including dense prediction. Vim's bidirectional SSMs allow it to handle long sequences and provide robust visual representations, making it a promising candidate for next-generation vision foundation models. The results demonstrate that Vim can effectively address the computational and memory constraints of traditional vision models, offering a more efficient and scalable solution for visual representation learning.Vision Mamba (Vim) is a novel vision backbone that leverages bidirectional state space models (SSMs) for efficient visual representation learning. Unlike traditional vision transformers (ViTs) that rely on self-attention, Vim uses a Mamba-inspired SSM to model visual data, achieving superior performance and efficiency. Vim processes images by splitting them into patches, linearly projecting them into sequences, and applying bidirectional SSMs with position embeddings to capture global visual context. This approach enables Vim to outperform ViTs in tasks such as image classification, semantic segmentation, and object detection, while significantly reducing computational and memory costs. For example, Vim is 2.8× faster and saves 86.8% GPU memory compared to DeiT when processing images of size 1248×1248. Vim's efficiency is further enhanced by its hardware-aware design, which optimizes memory and computation for high-resolution images. The model is trained on the ImageNet dataset and can be used as a backbone for various downstream tasks, including dense prediction. Vim's bidirectional SSMs allow it to handle long sequences and provide robust visual representations, making it a promising candidate for next-generation vision foundation models. The results demonstrate that Vim can effectively address the computational and memory constraints of traditional vision models, offering a more efficient and scalable solution for visual representation learning.
Reach us at info@study.space