26 May 2024 | Dongchen Han, Ziyi Wang, Zhuofan Xia, Yizeng Han, Yifan Pu, Chunjiang Ge, Jun Song, Shiji Song, Bo Zheng, Gao Huang
This paper explores the relationship between Mamba, a state space model with linear computational complexity, and linear attention Transformer. It reveals that Mamba shares surprising similarities with linear attention Transformer, which typically underperforms conventional Transformers. By analyzing the similarities and differences between Mamba and linear attention Transformer, the paper identifies six key design distinctions in Mamba: input gate, forget gate, shortcut, no attention normalization, single-head, and modified block design. The study shows that the forget gate and block design are the core contributors to Mamba's success, while the other four designs have less impact. Based on these findings, the paper proposes a Mamba-Like Linear Attention (MLLA) model that incorporates the merits of the forget gate and block design into linear attention. The resulting model outperforms various vision Mamba models in both image classification and high-resolution dense prediction tasks, while maintaining parallelizable computation and fast inference speed. The paper also demonstrates that linear attention can surpass Mamba with the merits of the two key designs. The results show that MLLA achieves superior performance in multiple vision tasks, including ImageNet-1K classification, COCO object detection, and ADE-20K semantic segmentation. The paper concludes that Mamba's success is largely due to its unique design elements, and that linear attention can be enhanced by incorporating these elements to achieve better performance.This paper explores the relationship between Mamba, a state space model with linear computational complexity, and linear attention Transformer. It reveals that Mamba shares surprising similarities with linear attention Transformer, which typically underperforms conventional Transformers. By analyzing the similarities and differences between Mamba and linear attention Transformer, the paper identifies six key design distinctions in Mamba: input gate, forget gate, shortcut, no attention normalization, single-head, and modified block design. The study shows that the forget gate and block design are the core contributors to Mamba's success, while the other four designs have less impact. Based on these findings, the paper proposes a Mamba-Like Linear Attention (MLLA) model that incorporates the merits of the forget gate and block design into linear attention. The resulting model outperforms various vision Mamba models in both image classification and high-resolution dense prediction tasks, while maintaining parallelizable computation and fast inference speed. The paper also demonstrates that linear attention can surpass Mamba with the merits of the two key designs. The results show that MLLA achieves superior performance in multiple vision tasks, including ImageNet-1K classification, COCO object detection, and ADE-20K semantic segmentation. The paper concludes that Mamba's success is largely due to its unique design elements, and that linear attention can be enhanced by incorporating these elements to achieve better performance.