25 Mar 2024 | Zhiwei Lin, Zhe Liu, Zhongyu Xia, Xinhao Wang, Yongtao Wang, Shengxiang Qi, Yang Dong, Nan Dong, Le Zhang, Ce Zhu
RCBEVDet is a radar-camera fusion method for 3D object detection in bird's eye view (BEV). The method combines radar and camera data to achieve more reliable 3D object detection. The key components of RCBEVDet include RadarBEVNet for efficient radar BEV feature extraction and the Cross-Attention Multi-layer Fusion (CAMF) module for robust radar-camera feature fusion. RadarBEVNet consists of a dual-stream radar backbone and an RCS-aware BEV encoder. The dual-stream radar backbone combines point-based and transformer-based encoders with an injection and extraction module to facilitate communication between the two encoders. The RCS-aware BEV encoder uses radar cross-section (RCS) as the object size prior to scatter point features into BEV space. The CAMF module uses deformable cross-attention to dynamically align and fuse radar and camera BEV features. Experimental results show that RCBEVDet achieves new state-of-the-art radar-camera fusion results on nuScenes and view-of-delft (VoD) 3D object detection benchmarks. RCBEVDet also achieves better 3D detection results than all real-time camera-only and radar-camera 3D object detectors with a faster inference speed at 21~28 FPS. The method demonstrates good robustness against sensor failure cases. The paper also presents ablation studies showing the effectiveness of each component of RCBEVDet. The results demonstrate that the proposed method significantly improves 3D object detection performance while maintaining real-time inference speed.RCBEVDet is a radar-camera fusion method for 3D object detection in bird's eye view (BEV). The method combines radar and camera data to achieve more reliable 3D object detection. The key components of RCBEVDet include RadarBEVNet for efficient radar BEV feature extraction and the Cross-Attention Multi-layer Fusion (CAMF) module for robust radar-camera feature fusion. RadarBEVNet consists of a dual-stream radar backbone and an RCS-aware BEV encoder. The dual-stream radar backbone combines point-based and transformer-based encoders with an injection and extraction module to facilitate communication between the two encoders. The RCS-aware BEV encoder uses radar cross-section (RCS) as the object size prior to scatter point features into BEV space. The CAMF module uses deformable cross-attention to dynamically align and fuse radar and camera BEV features. Experimental results show that RCBEVDet achieves new state-of-the-art radar-camera fusion results on nuScenes and view-of-delft (VoD) 3D object detection benchmarks. RCBEVDet also achieves better 3D detection results than all real-time camera-only and radar-camera 3D object detectors with a faster inference speed at 21~28 FPS. The method demonstrates good robustness against sensor failure cases. The paper also presents ablation studies showing the effectiveness of each component of RCBEVDet. The results demonstrate that the proposed method significantly improves 3D object detection performance while maintaining real-time inference speed.