This paper introduces Mamba®: an enhanced version of the Vision Mamba (Vim) architecture that addresses the issue of feature artifacts in the feature maps of Vim. The artifacts, which are high-norm tokens appearing in low-information background areas of images, are more severe in Vision Mamba and affect the model's ability to focus on semantically meaningful content. To mitigate this issue, the authors introduce register tokens into the Vision Mamba architecture. These tokens are evenly inserted throughout the input sequence and are reused for final decision predictions, leading to a more efficient and effective model.
The new architecture, Mamba®, demonstrates significant improvements over the original Vision Mamba. Qualitatively, Mamba® produces cleaner feature maps with a stronger focus on semantically relevant areas. Quantitatively, Mamba® achieves higher accuracy on benchmark datasets such as ImageNet. For instance, Mamba®-B achieves 82.9% accuracy on ImageNet, outperforming Vim-B's 81.8%. Additionally, Mamba® successfully scales to larger model sizes, achieving 83.2% accuracy with 341M parameters, and further improving to 84.5% when using larger input resolutions.
The paper also validates Mamba® on the ADE20k semantic segmentation task, where Mamba®-B achieves a 49.1% mIoU, significantly outperforming Vim's 44.9% mIoU. The authors also conduct ablation studies to evaluate the impact of register tokens on model performance, showing that evenly distributing registers and reusing them for final predictions significantly enhances the model's performance.
In conclusion, Mamba® represents a significant advancement in the Vision Mamba architecture, offering improved performance, scalability, and effectiveness in handling visual tasks. The introduction of register tokens helps mitigate the issue of feature artifacts, leading to more accurate and efficient image processing.This paper introduces Mamba®: an enhanced version of the Vision Mamba (Vim) architecture that addresses the issue of feature artifacts in the feature maps of Vim. The artifacts, which are high-norm tokens appearing in low-information background areas of images, are more severe in Vision Mamba and affect the model's ability to focus on semantically meaningful content. To mitigate this issue, the authors introduce register tokens into the Vision Mamba architecture. These tokens are evenly inserted throughout the input sequence and are reused for final decision predictions, leading to a more efficient and effective model.
The new architecture, Mamba®, demonstrates significant improvements over the original Vision Mamba. Qualitatively, Mamba® produces cleaner feature maps with a stronger focus on semantically relevant areas. Quantitatively, Mamba® achieves higher accuracy on benchmark datasets such as ImageNet. For instance, Mamba®-B achieves 82.9% accuracy on ImageNet, outperforming Vim-B's 81.8%. Additionally, Mamba® successfully scales to larger model sizes, achieving 83.2% accuracy with 341M parameters, and further improving to 84.5% when using larger input resolutions.
The paper also validates Mamba® on the ADE20k semantic segmentation task, where Mamba®-B achieves a 49.1% mIoU, significantly outperforming Vim's 44.9% mIoU. The authors also conduct ablation studies to evaluate the impact of register tokens on model performance, showing that evenly distributing registers and reusing them for final predictions significantly enhances the model's performance.
In conclusion, Mamba® represents a significant advancement in the Vision Mamba architecture, offering improved performance, scalability, and effectiveness in handling visual tasks. The introduction of register tokens helps mitigate the issue of feature artifacts, leading to more accurate and efficient image processing.