Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation

Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation

12 Sep 2024 | Zifu Wan, Pingping Zhang, Yuhao Wang, Silong Yong, Simon Stepputtis, Katia Sycara, Yaqi Xie
Sigma is a novel Siamese Mamba network designed for multi-modal semantic segmentation. It leverages the advanced Mamba architecture to achieve global receptive fields with linear complexity, surpassing traditional methods like CNNs and Vision Transformers (ViTs) that suffer from limited receptive fields or quadratic complexity. The network employs a Siamese encoder to extract modality-specific features and a fusion module to combine information from different modalities. A channel-aware decoder then enhances the model's ability to process spatial and channel-wise information. The encoder uses cascaded Visual State Space Blocks (VSSB) to extract multi-scale features, while the fusion module uses Cross Mamba Block (CroMB) and Concat Mamba Block (ConMB) to effectively select and combine information from different modalities. The decoder further refines the fused features to produce accurate semantic segmentation results. Extensive experiments on RGB-Thermal and RGB-Depth datasets demonstrate that Sigma outperforms state-of-the-art models in both accuracy and efficiency, marking the first successful application of State Space Models (SSMs) in multi-modal perception tasks. The model's effectiveness is validated through quantitative and qualitative comparisons, showing its ability to handle complex scenes and extract vital features for improved segmentation.Sigma is a novel Siamese Mamba network designed for multi-modal semantic segmentation. It leverages the advanced Mamba architecture to achieve global receptive fields with linear complexity, surpassing traditional methods like CNNs and Vision Transformers (ViTs) that suffer from limited receptive fields or quadratic complexity. The network employs a Siamese encoder to extract modality-specific features and a fusion module to combine information from different modalities. A channel-aware decoder then enhances the model's ability to process spatial and channel-wise information. The encoder uses cascaded Visual State Space Blocks (VSSB) to extract multi-scale features, while the fusion module uses Cross Mamba Block (CroMB) and Concat Mamba Block (ConMB) to effectively select and combine information from different modalities. The decoder further refines the fused features to produce accurate semantic segmentation results. Extensive experiments on RGB-Thermal and RGB-Depth datasets demonstrate that Sigma outperforms state-of-the-art models in both accuracy and efficiency, marking the first successful application of State Space Models (SSMs) in multi-modal perception tasks. The model's effectiveness is validated through quantitative and qualitative comparisons, showing its ability to handle complex scenes and extract vital features for improved segmentation.
Reach us at info@study.space