This paper proposes RSCaMa, a novel model for Remote Sensing Image Change Captioning (RSICC), which aims to describe changes between multi-temporal remote sensing images in language. RSICC involves identifying and describing changes in surface features, including object categories, locations, and dynamics. Previous methods have focused on spatial change perception but have limitations in joint spatial-temporal modeling. To address this, RSCaMa introduces a state space model (SSM), specifically Mamba, which offers a global receptive field and linear complexity. The model employs two key components: Spatial Difference-aware SSM (SD-SSM) for spatial change perception and Temporal-Traversing SSM (TT-SSM) for temporal modeling. SD-SSM enhances spatial change perception by using differential features, while TT-SSM improves temporal interaction through cross-scanning of bi-temporal features. The model's architecture includes a backbone for feature extraction, multiple CaMa layers for joint spatial-temporal modeling, and a language decoder for caption generation. Experiments show that RSCaMa outperforms existing methods in terms of captioning accuracy, with significant improvements in key metrics like BLEU-4 and S_m*. The model also demonstrates the potential of Mamba in RSICC tasks. Additionally, the paper compares three language decoders: Mamba, GPT-style decoder, and Transformer decoder, providing insights for future research. The code is available at https://github.com/Chen-Yang-Liu/RSCaMa.This paper proposes RSCaMa, a novel model for Remote Sensing Image Change Captioning (RSICC), which aims to describe changes between multi-temporal remote sensing images in language. RSICC involves identifying and describing changes in surface features, including object categories, locations, and dynamics. Previous methods have focused on spatial change perception but have limitations in joint spatial-temporal modeling. To address this, RSCaMa introduces a state space model (SSM), specifically Mamba, which offers a global receptive field and linear complexity. The model employs two key components: Spatial Difference-aware SSM (SD-SSM) for spatial change perception and Temporal-Traversing SSM (TT-SSM) for temporal modeling. SD-SSM enhances spatial change perception by using differential features, while TT-SSM improves temporal interaction through cross-scanning of bi-temporal features. The model's architecture includes a backbone for feature extraction, multiple CaMa layers for joint spatial-temporal modeling, and a language decoder for caption generation. Experiments show that RSCaMa outperforms existing methods in terms of captioning accuracy, with significant improvements in key metrics like BLEU-4 and S_m*. The model also demonstrates the potential of Mamba in RSICC tasks. Additionally, the paper compares three language decoders: Mamba, GPT-style decoder, and Transformer decoder, providing insights for future research. The code is available at https://github.com/Chen-Yang-Liu/RSCaMa.