MambaVision: A Hybrid Mamba-Transformer Vision Backbone

MambaVision: A Hybrid Mamba-Transformer Vision Backbone

10 Jul 2024 | Ali Hatamizadeh, Jan Kautz
The paper introduces MambaVision, a novel hybrid Mamba-Transformer backbone designed specifically for vision applications. The core contributions include a redesigned Mamba formulation to enhance its capability for efficient modeling of visual features and a comprehensive ablation study on integrating Vision Transformers (ViT) with Mamba. The results demonstrate that incorporating several self-attention blocks at the final layers significantly improves the model's ability to capture long-range spatial dependencies. MambaVision is structured with a hierarchical architecture, leveraging CNN-based residual blocks for fast feature extraction at higher resolutions and MambaVision and Transformer blocks at lower resolutions. This design achieves a new State-of-the-Art (SOTA) performance in terms of Top-1 accuracy and image throughput on the ImageNet-1K dataset. In downstream tasks such as object detection, instance segmentation, and semantic segmentation on MS COCO and ADE20K datasets, MambaVision outperforms comparably-sized backbones. The paper also includes a detailed methodology section, experimental results, and an ablation study to validate the effectiveness of the proposed design choices.The paper introduces MambaVision, a novel hybrid Mamba-Transformer backbone designed specifically for vision applications. The core contributions include a redesigned Mamba formulation to enhance its capability for efficient modeling of visual features and a comprehensive ablation study on integrating Vision Transformers (ViT) with Mamba. The results demonstrate that incorporating several self-attention blocks at the final layers significantly improves the model's ability to capture long-range spatial dependencies. MambaVision is structured with a hierarchical architecture, leveraging CNN-based residual blocks for fast feature extraction at higher resolutions and MambaVision and Transformer blocks at lower resolutions. This design achieves a new State-of-the-Art (SOTA) performance in terms of Top-1 accuracy and image throughput on the ImageNet-1K dataset. In downstream tasks such as object detection, instance segmentation, and semantic segmentation on MS COCO and ADE20K datasets, MambaVision outperforms comparably-sized backbones. The paper also includes a detailed methodology section, experimental results, and an ablation study to validate the effectiveness of the proposed design choices.
Reach us at info@study.space