ReMamber: Referring Image Segmentation with Mamba Twister

ReMamber: Referring Image Segmentation with Mamba Twister

25 Jul 2024 | Yuhuan Yang*, Chaofan Ma*, Jiangchao Yao¹, Zhun Zhong², Ya Zhang¹, and Yanfeng Wang¹
ReMamber is a novel referring image segmentation (RIS) architecture that integrates the Mamba framework with a multi-modal Mamba Twister block. The Mamba Twister explicitly models image-text interaction and fuses textual and visual features through its unique channel and spatial twisting mechanism. The architecture achieves competitive results on three challenging benchmarks with a simple and efficient design. The paper also presents thorough analyses of Re-Mamber and discusses other fusion designs using Mamba, providing valuable insights for future research. The code is available at https://github.com/yyh-rain-song/ReMamber. The paper introduces ReMamber, a novel RIS architecture that leverages the Mamba framework to address the computational inefficiency of traditional transformers in capturing long-range visual-language dependencies. The Mamba Twister block is designed to explicitly model image-text interactions and fuse textual and visual features through a unique channel and spatial twisting mechanism. This block consists of several visual state space (VSS) layers and a Twisting layer. The VSS layers extract visual features, while the Twisting layer injects textual information into the visual modality. The Twisting layer is structured into three critical components: (1) a vision-language interaction operation to capture fine-grained interactions between modalities, (2) a hybrid feature cube created by concatenating visual, multimodal, and global textual features, and (3) a twisting mechanism that enhances interaction within and across modalities. The paper also discusses other fusion designs using Mamba, including In-context Conditioning, Attention-based Conditioning, and Norm Adaptation. These variants are evaluated on three benchmark datasets: RefCOCO, RefCOCO+, and G-Ref. The results show that the Mamba Twister consistently outperforms other variants across all metrics and datasets, indicating its superior capability in capturing and integrating contextual information for more accurate segmentation. The paper also presents ablation studies on the combination of two scans and the effects of global and local interactions, showing that the combination of Channel-Spatial Scan offers a considerable advantage. The paper concludes that ReMamber is a significant advancement in multi-modal understanding, demonstrating the potential of Mamba architecture in enhancing the scalability and performance of multi-modal tasks. The paper also highlights the limitations of the current architecture, including the segmentation decoder being constructed by only a few convolutional layers. Future work will focus on investigating more sophisticated multi-modal segmentation decoders that best fit the Mamba architecture.ReMamber is a novel referring image segmentation (RIS) architecture that integrates the Mamba framework with a multi-modal Mamba Twister block. The Mamba Twister explicitly models image-text interaction and fuses textual and visual features through its unique channel and spatial twisting mechanism. The architecture achieves competitive results on three challenging benchmarks with a simple and efficient design. The paper also presents thorough analyses of Re-Mamber and discusses other fusion designs using Mamba, providing valuable insights for future research. The code is available at https://github.com/yyh-rain-song/ReMamber. The paper introduces ReMamber, a novel RIS architecture that leverages the Mamba framework to address the computational inefficiency of traditional transformers in capturing long-range visual-language dependencies. The Mamba Twister block is designed to explicitly model image-text interactions and fuse textual and visual features through a unique channel and spatial twisting mechanism. This block consists of several visual state space (VSS) layers and a Twisting layer. The VSS layers extract visual features, while the Twisting layer injects textual information into the visual modality. The Twisting layer is structured into three critical components: (1) a vision-language interaction operation to capture fine-grained interactions between modalities, (2) a hybrid feature cube created by concatenating visual, multimodal, and global textual features, and (3) a twisting mechanism that enhances interaction within and across modalities. The paper also discusses other fusion designs using Mamba, including In-context Conditioning, Attention-based Conditioning, and Norm Adaptation. These variants are evaluated on three benchmark datasets: RefCOCO, RefCOCO+, and G-Ref. The results show that the Mamba Twister consistently outperforms other variants across all metrics and datasets, indicating its superior capability in capturing and integrating contextual information for more accurate segmentation. The paper also presents ablation studies on the combination of two scans and the effects of global and local interactions, showing that the combination of Channel-Spatial Scan offers a considerable advantage. The paper concludes that ReMamber is a significant advancement in multi-modal understanding, demonstrating the potential of Mamba architecture in enhancing the scalability and performance of multi-modal tasks. The paper also highlights the limitations of the current architecture, including the segmentation decoder being constructed by only a few convolutional layers. Future work will focus on investigating more sophisticated multi-modal segmentation decoders that best fit the Mamba architecture.
Reach us at info@study.space