RoboFusion: Towards Robust Multi-Modal 3D Object Detection via SAM

RoboFusion: Towards Robust Multi-Modal 3D Object Detection via SAM

23 Apr 2024 | Ziying Song1, Guoxing Zhang2, Lin Liu1, Lei Yang3, Shaoqing Xu4, Caiyan Jia1*, Feiyang Jia1, Li Wang5
RoboFusion is a robust framework designed to enhance the robustness and generalization of multi-modal 3D object detectors in autonomous driving (AD) scenarios. It leverages visual foundation models (VMFs) like SAM to address out-of-distribution (OOD) noise scenarios. The framework includes several key components: 1. **SAM-AD**: Pre-trained SAM for AD scenarios to adapt it to the specific challenges of AD. 2. **AD-FPN**: A novel module that upsamples image features extracted by SAM to align with multi-modal methods. 3. **Depth-Guided Wavelet Attention (DGWA)**: A module that uses wavelet decomposition to denoise depth-guided image features, reducing noise and retaining essential signal features. 4. **Adaptive Fusion**: A module that employs self-attention mechanisms to adaptively reweight fused features, enhancing informative features while suppressing excess noise. The framework is evaluated on both clean and noisy datasets, including KITTI-C and nuScenes-C, demonstrating superior performance in noisy scenarios compared to state-of-the-art (SOTA) methods. RoboFusion achieves SOTA performance in noisy scenarios, as shown by the KITTI-C and nuScenes-C benchmarks. The code for RoboFusion is available at https://github.com/adept-thu/RoboFusion.RoboFusion is a robust framework designed to enhance the robustness and generalization of multi-modal 3D object detectors in autonomous driving (AD) scenarios. It leverages visual foundation models (VMFs) like SAM to address out-of-distribution (OOD) noise scenarios. The framework includes several key components: 1. **SAM-AD**: Pre-trained SAM for AD scenarios to adapt it to the specific challenges of AD. 2. **AD-FPN**: A novel module that upsamples image features extracted by SAM to align with multi-modal methods. 3. **Depth-Guided Wavelet Attention (DGWA)**: A module that uses wavelet decomposition to denoise depth-guided image features, reducing noise and retaining essential signal features. 4. **Adaptive Fusion**: A module that employs self-attention mechanisms to adaptively reweight fused features, enhancing informative features while suppressing excess noise. The framework is evaluated on both clean and noisy datasets, including KITTI-C and nuScenes-C, demonstrating superior performance in noisy scenarios compared to state-of-the-art (SOTA) methods. RoboFusion achieves SOTA performance in noisy scenarios, as shown by the KITTI-C and nuScenes-C benchmarks. The code for RoboFusion is available at https://github.com/adept-thu/RoboFusion.
Reach us at info@study.space