26 Apr 2024 | Hanwei Zhang, Ying Zhu, Dan Wang, Lijun Zhang, Tianxiang Chen, and Zi Ye
This paper presents a comprehensive survey of Mamba models in the field of computer vision. Mamba is a state space model (SSM) with selection mechanisms and hardware-aware architectures that have shown significant promise in long-sequence modeling. Unlike traditional models such as transformers, which rely on quadratic complexity attention mechanisms, Mamba excels in handling long sequences with linear complexity and is particularly effective at processing lengthy videos at high resolutions. The paper begins by exploring the foundational concepts contributing to Mamba's success, including the state space model framework, selection mechanisms, and hardware-aware design. It then reviews vision Mamba models by categorizing them into foundational ones and enhancing them with techniques such as convolution, recurrence, and attention to improve their sophistication. The paper further delves into the widespread applications of Mamba in vision tasks, which include their use as a backbone in various levels of vision processing. This encompasses general visual tasks, medical visual tasks (e.g., 2D/3D segmentation, classification, and image registration), and remote sensing visual tasks. The paper also introduces general visual tasks from two levels: high/mid-level vision (e.g., object detection, segmentation, video classification) and low-level vision (e.g., image super-resolution, image restoration, visual generation). The paper concludes by discussing the potential of Mamba in various sequential data processing tasks and its ability to address current challenges in computer vision. The key contributions of this survey include providing a comprehensive review of the Mamba technique in the vision domain, expanding upon the naive-based Mamba visual framework, and offering an in-depth exploration by organizing the literature based on various application tasks. The paper also discusses the application of Mamba technologies in addressing various computer vision tasks and concludes with a summary of the key findings and implications of Mamba in the field of computer vision.This paper presents a comprehensive survey of Mamba models in the field of computer vision. Mamba is a state space model (SSM) with selection mechanisms and hardware-aware architectures that have shown significant promise in long-sequence modeling. Unlike traditional models such as transformers, which rely on quadratic complexity attention mechanisms, Mamba excels in handling long sequences with linear complexity and is particularly effective at processing lengthy videos at high resolutions. The paper begins by exploring the foundational concepts contributing to Mamba's success, including the state space model framework, selection mechanisms, and hardware-aware design. It then reviews vision Mamba models by categorizing them into foundational ones and enhancing them with techniques such as convolution, recurrence, and attention to improve their sophistication. The paper further delves into the widespread applications of Mamba in vision tasks, which include their use as a backbone in various levels of vision processing. This encompasses general visual tasks, medical visual tasks (e.g., 2D/3D segmentation, classification, and image registration), and remote sensing visual tasks. The paper also introduces general visual tasks from two levels: high/mid-level vision (e.g., object detection, segmentation, video classification) and low-level vision (e.g., image super-resolution, image restoration, visual generation). The paper concludes by discussing the potential of Mamba in various sequential data processing tasks and its ability to address current challenges in computer vision. The key contributions of this survey include providing a comprehensive review of the Mamba technique in the vision domain, expanding upon the naive-based Mamba visual framework, and offering an in-depth exploration by organizing the literature based on various application tasks. The paper also discusses the application of Mamba technologies in addressing various computer vision tasks and concludes with a summary of the key findings and implications of Mamba in the field of computer vision.