Understanding RSCaMa%3A Remote Sensing Image Change Captioning With State Space Model

The paper introduces RSCaMa, a novel model for Remote Sensing Image Change Captioning (RSICC) that aims to describe surface changes between multi-temporal remote sensing images. RSICC challenges include spatial and temporal modeling of bi-temporal features, with previous methods focusing primarily on spatial change perception. To address these challenges, RSCaMa employs multiple CaMa layers to achieve efficient joint spatial-temporal modeling. The model integrates the Mamba state space model, which has a global receptive field and linear complexity, into the RSICC task. Specifically, it proposes the Spatial Difference-aware SSM (SD-SSM) to enhance spatial change perception and the Temporal-Traversing SSM (TT-SSM) to facilitate temporal modeling. SD-SSM uses differential features to improve spatial change perception, while TT-SSM scans bi-temporal features in a temporal cross-wise manner to enhance temporal understanding. Experiments on the LEVIR-CC dataset demonstrate the effectiveness of RSCaMa's joint spatial-temporal modeling and highlight the potential of Mamba in RSICC. The paper also compares three language decoders—Mamba, GPT-style decoder, and Transformer decoder—and provides valuable insights for future research.The paper introduces RSCaMa, a novel model for Remote Sensing Image Change Captioning (RSICC) that aims to describe surface changes between multi-temporal remote sensing images. RSICC challenges include spatial and temporal modeling of bi-temporal features, with previous methods focusing primarily on spatial change perception. To address these challenges, RSCaMa employs multiple CaMa layers to achieve efficient joint spatial-temporal modeling. The model integrates the Mamba state space model, which has a global receptive field and linear complexity, into the RSICC task. Specifically, it proposes the Spatial Difference-aware SSM (SD-SSM) to enhance spatial change perception and the Temporal-Traversing SSM (TT-SSM) to facilitate temporal modeling. SD-SSM uses differential features to improve spatial change perception, while TT-SSM scans bi-temporal features in a temporal cross-wise manner to enhance temporal understanding. Experiments on the LEVIR-CC dataset demonstrate the effectiveness of RSCaMa's joint spatial-temporal modeling and highlight the potential of Mamba in RSICC. The paper also compares three language decoders—Mamba, GPT-style decoder, and Transformer decoder—and provides valuable insights for future research.

RSCaMa: Remote Sensing Image Change Captioning with State Space Model

2024 | Chenyang Liu, Keyan Chen, Bowen Chen, Haotian Zhang, Zhengxia Zou, Member, IEEE, and Zhenwei Shi*, Senior Member, IEEE