Understanding Demystify Mamba in Vision%3A A Linear Attention Perspective

This paper explores the relationship between Mamba, a state space model with linear computation complexity, and linear attention Transformers, which are known to underperform conventional Transformers. The authors reveal that Mamba shares surprising similarities with linear attention Transformers, highlighting six key distinctions: input gate, forget gate, shortcut, no attention normalization, single-head, and modified block design. Through empirical analysis, they find that the forget gate and block design are the core contributors to Mamba's success, while the other designs have marginal impact or even hinder performance. Based on these findings, the authors propose the Mamba-Like Linear Attention (MLLA) model, which integrates the beneficial aspects of Mamba and linear attention Transformers. The MLLA model outperforms various vision Mamba models in image classification and high-resolution dense prediction tasks, while maintaining parallelizable computation and fast inference speed. The paper provides a comprehensive analysis of the key factors behind Mamba's effectiveness and offers a practical solution to improve linear attention Transformers in vision tasks.This paper explores the relationship between Mamba, a state space model with linear computation complexity, and linear attention Transformers, which are known to underperform conventional Transformers. The authors reveal that Mamba shares surprising similarities with linear attention Transformers, highlighting six key distinctions: input gate, forget gate, shortcut, no attention normalization, single-head, and modified block design. Through empirical analysis, they find that the forget gate and block design are the core contributors to Mamba's success, while the other designs have marginal impact or even hinder performance. Based on these findings, the authors propose the Mamba-Like Linear Attention (MLLA) model, which integrates the beneficial aspects of Mamba and linear attention Transformers. The MLLA model outperforms various vision Mamba models in image classification and high-resolution dense prediction tasks, while maintaining parallelizable computation and fast inference speed. The paper provides a comprehensive analysis of the key factors behind Mamba's effectiveness and offers a practical solution to improve linear attention Transformers in vision tasks.

Demystify Mamba in Vision: A Linear Attention Perspective

26 May 2024 | Dongchen Han, Ziyi Wang, Zhuofan Xia, Yizeng Han, Yifan Pu, Chunjiang Ge, Jun Song, Shiji Song, Bo Zheng, Gao Huang