Understanding Sigma%3A Siamese Mamba Network for Multi-Modal Semantic Segmentation

The paper introduces Sigma, a Siamese Mamba network designed for multi-modal semantic segmentation. Sigma leverages the advanced Mamba model, which offers global receptive fields with linear complexity, to enhance the robustness and reliability of AI agents in adverse conditions such as low-light or overexposed environments. The network integrates a Siamese encoder for modality-specific feature extraction, a fusion module for combining information from different modalities (RGB, thermal, and depth), and a decoder for enhancing channel-wise modeling. The fusion module uses Cross Mamba Blocks (CroMB) and Concat Mamba Blocks (ConMB) to effectively select and integrate essential information from various modalities. The decoder employs Channel-Aware Visual State Space Blocks (CAVSSB) to capture long-range dependencies and improve spatial and channel-specific information. Extensive experiments on RGB-Thermal and RGB-Depth datasets demonstrate that Sigma outperforms state-of-the-art models in both accuracy and efficiency, marking the first successful application of State Space Models (SSMs) in multi-modal semantic segmentation. The paper also discusses the limitations and future work, including the potential for extending Mamba to handle longer sequences and reducing memory consumption in the Mamba encoder.The paper introduces Sigma, a Siamese Mamba network designed for multi-modal semantic segmentation. Sigma leverages the advanced Mamba model, which offers global receptive fields with linear complexity, to enhance the robustness and reliability of AI agents in adverse conditions such as low-light or overexposed environments. The network integrates a Siamese encoder for modality-specific feature extraction, a fusion module for combining information from different modalities (RGB, thermal, and depth), and a decoder for enhancing channel-wise modeling. The fusion module uses Cross Mamba Blocks (CroMB) and Concat Mamba Blocks (ConMB) to effectively select and integrate essential information from various modalities. The decoder employs Channel-Aware Visual State Space Blocks (CAVSSB) to capture long-range dependencies and improve spatial and channel-specific information. Extensive experiments on RGB-Thermal and RGB-Depth datasets demonstrate that Sigma outperforms state-of-the-art models in both accuracy and efficiency, marking the first successful application of State Space Models (SSMs) in multi-modal semantic segmentation. The paper also discusses the limitations and future work, including the potential for extending Mamba to handle longer sequences and reducing memory consumption in the Mamba encoder.

Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation

12 Sep 2024 | Zifu Wan, Pingping Zhang, Yuhao Wang, Silong Yong, Simon Stepputtis, Katia Sycara, Yaqi Xie