Fusion-Mamba for Cross-modal Object Detection

Fusion-Mamba for Cross-modal Object Detection

14 Apr 2024 | Wenhao Dong, Haodong Zhu, Shaohui Lin, Xiaoyan Luo, Yunhang Shen, Xuhui Liu, Juan Zhang, Guodong Guo, Baochang Zhang
This paper proposes a novel Fusion-Mamba method for cross-modality object detection, which leverages an improved Mamba with a gating mechanism to map cross-modal features into a hidden state space for interaction, thereby reducing disparities between cross-modal features and enhancing the representation consistency of fused features. The method introduces a Fusion-Mamba block (FMB) containing two modules: the State Space Channel Swapping (SSCS) module for shallow feature fusion and the Dual State Space Fusion (DSSF) module for deep fusion in a hidden state space. The FMB effectively reduces modality disparities and enhances the representation consistency of fused features. Extensive experiments on three public RGB-IR object detection datasets demonstrate that the proposed method achieves state-of-the-art performance, outperforming existing methods on mAP by 5.9% on M^3FD and 4.9% on FLIR-Aligned datasets. The method is the first to explore the potential of Mamba for cross-modal fusion and establishes a new baseline for cross-modality object detection. The proposed method is efficient and effective, with a linear time complexity and a significant improvement in detection performance compared to Transformer-based methods. The results show that the FMB effectively reduces modality disparities and enhances the representation consistency of fused features, leading to better object detection performance. The method is implemented on a dual-stream feature extraction network with three FMB blocks, and the detection network includes a neck and head similar to YOLOv8. The method is evaluated on three widely-used visible-infrared benchmark datasets, LLVIP, M^3FD, and FLIR, and achieves superior performance compared to state-of-the-art methods. The method is also effective for both YOLOv5 and YOLOv8 backbones, with the YOLOv8 backbone achieving state-of-the-art performance. The method is efficient and effective, with a linear time complexity and a significant improvement in detection performance compared to Transformer-based methods. The results show that the FMB effectively reduces modality disparities and enhances the representation consistency of fused features, leading to better object detection performance.This paper proposes a novel Fusion-Mamba method for cross-modality object detection, which leverages an improved Mamba with a gating mechanism to map cross-modal features into a hidden state space for interaction, thereby reducing disparities between cross-modal features and enhancing the representation consistency of fused features. The method introduces a Fusion-Mamba block (FMB) containing two modules: the State Space Channel Swapping (SSCS) module for shallow feature fusion and the Dual State Space Fusion (DSSF) module for deep fusion in a hidden state space. The FMB effectively reduces modality disparities and enhances the representation consistency of fused features. Extensive experiments on three public RGB-IR object detection datasets demonstrate that the proposed method achieves state-of-the-art performance, outperforming existing methods on mAP by 5.9% on M^3FD and 4.9% on FLIR-Aligned datasets. The method is the first to explore the potential of Mamba for cross-modal fusion and establishes a new baseline for cross-modality object detection. The proposed method is efficient and effective, with a linear time complexity and a significant improvement in detection performance compared to Transformer-based methods. The results show that the FMB effectively reduces modality disparities and enhances the representation consistency of fused features, leading to better object detection performance. The method is implemented on a dual-stream feature extraction network with three FMB blocks, and the detection network includes a neck and head similar to YOLOv8. The method is evaluated on three widely-used visible-infrared benchmark datasets, LLVIP, M^3FD, and FLIR, and achieves superior performance compared to state-of-the-art methods. The method is also effective for both YOLOv5 and YOLOv8 backbones, with the YOLOv8 backbone achieving state-of-the-art performance. The method is efficient and effective, with a linear time complexity and a significant improvement in detection performance compared to Transformer-based methods. The results show that the FMB effectively reduces modality disparities and enhances the representation consistency of fused features, leading to better object detection performance.
Reach us at info@study.space