This paper introduces Mamba $ ^{\circledR} $, an enhanced version of Vision Mamba (Vim), which addresses the issue of feature artifacts in feature maps. These artifacts, characterized by high-norm tokens in low-information background areas, are more severe in Vision Mamba than in Vision Transformers (ViTs). To mitigate this, Mamba $ ^{\circledR} $ introduces register tokens into the input sequence, evenly distributed and reused for final predictions. This architecture improves feature map quality, focusing on semantically meaningful regions and enhancing performance.
The paper evaluates Mamba $ ^{\circledR} $ on ImageNet and ADE20k benchmarks. On ImageNet, Mamba $ ^{\circledR} $-B achieves 82.9% accuracy, outperforming Vim-B's 81.8%. It also scales well, achieving 83.2% accuracy with 341M parameters. On ADE20k, Mamba $ ^{\circledR} $-B achieves 49.1% mIoU, surpassing Vim's 44.9% mIoU. The architecture also shows improved scalability and performance in semantic segmentation tasks.
The paper also presents ablation studies showing that evenly distributing registers and reusing them for final predictions significantly improves performance. The results demonstrate that Mamba $ ^{\circledR} $ not only reduces artifacts but also enhances the model's ability to capture global features, leading to cleaner and more effective feature maps. The architecture is efficient, scalable, and effective for vision tasks, offering a solid foundation for future research in optimizing Mamba architectures.This paper introduces Mamba $ ^{\circledR} $, an enhanced version of Vision Mamba (Vim), which addresses the issue of feature artifacts in feature maps. These artifacts, characterized by high-norm tokens in low-information background areas, are more severe in Vision Mamba than in Vision Transformers (ViTs). To mitigate this, Mamba $ ^{\circledR} $ introduces register tokens into the input sequence, evenly distributed and reused for final predictions. This architecture improves feature map quality, focusing on semantically meaningful regions and enhancing performance.
The paper evaluates Mamba $ ^{\circledR} $ on ImageNet and ADE20k benchmarks. On ImageNet, Mamba $ ^{\circledR} $-B achieves 82.9% accuracy, outperforming Vim-B's 81.8%. It also scales well, achieving 83.2% accuracy with 341M parameters. On ADE20k, Mamba $ ^{\circledR} $-B achieves 49.1% mIoU, surpassing Vim's 44.9% mIoU. The architecture also shows improved scalability and performance in semantic segmentation tasks.
The paper also presents ablation studies showing that evenly distributing registers and reusing them for final predictions significantly improves performance. The results demonstrate that Mamba $ ^{\circledR} $ not only reduces artifacts but also enhances the model's ability to capture global features, leading to cleaner and more effective feature maps. The architecture is efficient, scalable, and effective for vision tasks, offering a solid foundation for future research in optimizing Mamba architectures.