2407.08083v1 [cs.CV] 10 Jul 2024 | Ali Hatamizadeh, Jan Kautz
MambaVision is a hybrid Mamba-Transformer backbone designed for vision tasks. The paper introduces a redesigned Mamba block to enhance its ability to model visual features efficiently. It also conducts an ablation study on integrating Vision Transformers (ViT) with Mamba, showing that adding self-attention blocks in the final layers significantly improves the model's ability to capture long-range spatial dependencies. MambaVision is a family of models with a hierarchical architecture that achieves state-of-the-art (SOTA) performance on ImageNet-1K in terms of Top-1 accuracy and image throughput. It outperforms comparably-sized backbones in downstream tasks such as object detection, instance segmentation, and semantic segmentation on MS COCO and ADE20K datasets. The model combines CNN-based residual blocks for fast feature extraction and leverages both Mamba and Transformer blocks for enhanced performance. MambaVision is the first hybrid architecture combining Mamba and Transformers for computer vision applications. The main contributions include a redesigned Mamba block for vision tasks, a systematic investigation of integration patterns between Mamba and Transformer blocks, and the introduction of MambaVision as a novel hybrid model. The model achieves a new SOTA Pareto front on ImageNet-1K in terms of Top-1 accuracy and image throughput. It also outperforms other models in various downstream tasks. The paper also presents ablation studies showing that the design of the token mixer and hybrid integration patterns significantly affect performance. MambaVision is shown to be effective for different vision tasks, especially in high-resolution settings.MambaVision is a hybrid Mamba-Transformer backbone designed for vision tasks. The paper introduces a redesigned Mamba block to enhance its ability to model visual features efficiently. It also conducts an ablation study on integrating Vision Transformers (ViT) with Mamba, showing that adding self-attention blocks in the final layers significantly improves the model's ability to capture long-range spatial dependencies. MambaVision is a family of models with a hierarchical architecture that achieves state-of-the-art (SOTA) performance on ImageNet-1K in terms of Top-1 accuracy and image throughput. It outperforms comparably-sized backbones in downstream tasks such as object detection, instance segmentation, and semantic segmentation on MS COCO and ADE20K datasets. The model combines CNN-based residual blocks for fast feature extraction and leverages both Mamba and Transformer blocks for enhanced performance. MambaVision is the first hybrid architecture combining Mamba and Transformers for computer vision applications. The main contributions include a redesigned Mamba block for vision tasks, a systematic investigation of integration patterns between Mamba and Transformer blocks, and the introduction of MambaVision as a novel hybrid model. The model achieves a new SOTA Pareto front on ImageNet-1K in terms of Top-1 accuracy and image throughput. It also outperforms other models in various downstream tasks. The paper also presents ablation studies showing that the design of the token mixer and hybrid integration patterns significantly affect performance. MambaVision is shown to be effective for different vision tasks, especially in high-resolution settings.