RSMamba: Remote Sensing Image Classification with State Space Model

RSMamba: Remote Sensing Image Classification with State Space Model

28 Mar 2024 | Keyan Chen¹, Bowen Chen¹, Chenyang Liu¹, Wenyuan Li², Zhengxia Zou¹, Zhenwei Shi¹,*
RSMamba is a novel architecture for remote sensing image classification based on the State Space Model (SSM) and incorporates an efficient, hardware-aware design known as the Mamba. It integrates the advantages of both a global receptive field and linear modeling complexity. To overcome the limitation of the vanilla Mamba, which can only model causal sequences and is not adaptable to two-dimensional image data, we propose a dynamic multi-path activation mechanism to augment Mamba's capacity to model non-causal data. Notably, RSMamba maintains the inherent modeling mechanism of the vanilla Mamba, yet exhibits superior performance across multiple remote sensing image classification datasets. This indicates that RSMamba holds significant potential to function as the backbone of future visual foundation models. RSMamba transforms 2-D images into 1-D sequences and captures long-distance dependencies using the Multi-Path SSM Encoder. Given an image, we employ a 2-D convolution with a kernel of k and a stride of s to map local patches into pixel-wise feature embeddings. Subsequently, the feature map is flattened into a 1-D sequence. To preserve the relative spatial position relationship within the image, we incorporate position encoding P. The entire process is as follows: T = ΦFlatten(ΦConv2D(I, k, s)), T = T + P. In RSMamba, we have not utilized the [CLS] token to aggregate the global representation, as is done in ViT. Instead, the sequence is fed into multiple dynamic multi-path activation Mamba blocks for long-distance dependency modeling. Subsequently, the dense features necessary for category prediction are derived through a mean pooling operation applied to the sequence. This procedure can be iteratively delineated as follows: T^i = Φmp-ssm^i(T^{i-1}) + T^{i-1}, s = Φproj(ΦLN(Φmean(T^N))). The dynamic multi-path activation mechanism is introduced to augment its capacity for 2-D data. Importantly, this mechanism, to preserve the structure of the vanilla Mamba block, exclusively operates on the block's input and output. Specifically, we duplicate three copies of the input sequence to establish three different paths, namely the forward path, reverse path, and random shuffle path, and leverage a plain Mamba mixer with shared parameters to model the dependency relationships among tokens within these three sequences, respectively. Subsequently, we revert all tokens in the sequences to the correct order and employ a linear layer to condense sequence information, thereby establishing the gate of the three paths. This gate is then used to activate the representation of the three different information flows as shown in Fig. 1. RSMamba is based on the previous Mamba, but has introduced a dynamic multi-path activation mechanism to alleviate the limitations of the plain Mamba, which can only model in a single direction and is position-agnostic. Significantly, RSMamba is designedRSMamba is a novel architecture for remote sensing image classification based on the State Space Model (SSM) and incorporates an efficient, hardware-aware design known as the Mamba. It integrates the advantages of both a global receptive field and linear modeling complexity. To overcome the limitation of the vanilla Mamba, which can only model causal sequences and is not adaptable to two-dimensional image data, we propose a dynamic multi-path activation mechanism to augment Mamba's capacity to model non-causal data. Notably, RSMamba maintains the inherent modeling mechanism of the vanilla Mamba, yet exhibits superior performance across multiple remote sensing image classification datasets. This indicates that RSMamba holds significant potential to function as the backbone of future visual foundation models. RSMamba transforms 2-D images into 1-D sequences and captures long-distance dependencies using the Multi-Path SSM Encoder. Given an image, we employ a 2-D convolution with a kernel of k and a stride of s to map local patches into pixel-wise feature embeddings. Subsequently, the feature map is flattened into a 1-D sequence. To preserve the relative spatial position relationship within the image, we incorporate position encoding P. The entire process is as follows: T = ΦFlatten(ΦConv2D(I, k, s)), T = T + P. In RSMamba, we have not utilized the [CLS] token to aggregate the global representation, as is done in ViT. Instead, the sequence is fed into multiple dynamic multi-path activation Mamba blocks for long-distance dependency modeling. Subsequently, the dense features necessary for category prediction are derived through a mean pooling operation applied to the sequence. This procedure can be iteratively delineated as follows: T^i = Φmp-ssm^i(T^{i-1}) + T^{i-1}, s = Φproj(ΦLN(Φmean(T^N))). The dynamic multi-path activation mechanism is introduced to augment its capacity for 2-D data. Importantly, this mechanism, to preserve the structure of the vanilla Mamba block, exclusively operates on the block's input and output. Specifically, we duplicate three copies of the input sequence to establish three different paths, namely the forward path, reverse path, and random shuffle path, and leverage a plain Mamba mixer with shared parameters to model the dependency relationships among tokens within these three sequences, respectively. Subsequently, we revert all tokens in the sequences to the correct order and employ a linear layer to condense sequence information, thereby establishing the gate of the three paths. This gate is then used to activate the representation of the three different information flows as shown in Fig. 1. RSMamba is based on the previous Mamba, but has introduced a dynamic multi-path activation mechanism to alleviate the limitations of the plain Mamba, which can only model in a single direction and is position-agnostic. Significantly, RSMamba is designed
Reach us at info@study.space