[slides] GraphBEV%3A Towards Robust BEV Feature Alignment for Multi-Modal 3D Object Detection

The paper "GraphBEV: Towards Robust BEV Feature Alignment for Multi-Modal 3D Object Detection" addresses the challenge of feature misalignment in multi-modal 3D object detection, particularly in autonomous driving systems. The authors propose a robust fusion framework called GraphBEV to improve the alignment between LiDAR and camera BEV (Bird’s-Eye-View) features. The framework includes two main modules: LocalAlign and GlobalAlign. 1. **LocalAlign Module**: This module addresses local misalignment by enhancing the camera-to-BEV transformation using neighbor-aware depth features through graph matching. It uses a KD-Tree algorithm to obtain the indices of the projected pixels' neighbors and then fuses the projected depth with neighboring depth features to improve the accuracy of depth estimation. 2. **GlobalAlign Module**: This module tackles global misalignment by encoding the projected depth from LiDAR-to-camera and the neighboring depth through dual depth encoding. It generates a new reliable depth representation that incorporates neighbor information and aligns the global multi-modal features using learnable offsets. The authors evaluate GraphBEV on the nuScenes dataset, demonstrating state-of-the-art performance with an mAP of 70.1%, outperforming the baseline BEVFusion by 1.6% on the validation set. Notably, GraphBEV also outperforms BEVFusion by 8.3% under noisy misalignment conditions. The framework is shown to be robust to various factors such as weather conditions, ego distances, and object sizes, further enhancing its practical applicability in real-world scenarios.The paper "GraphBEV: Towards Robust BEV Feature Alignment for Multi-Modal 3D Object Detection" addresses the challenge of feature misalignment in multi-modal 3D object detection, particularly in autonomous driving systems. The authors propose a robust fusion framework called GraphBEV to improve the alignment between LiDAR and camera BEV (Bird’s-Eye-View) features. The framework includes two main modules: LocalAlign and GlobalAlign. 1. **LocalAlign Module**: This module addresses local misalignment by enhancing the camera-to-BEV transformation using neighbor-aware depth features through graph matching. It uses a KD-Tree algorithm to obtain the indices of the projected pixels' neighbors and then fuses the projected depth with neighboring depth features to improve the accuracy of depth estimation. 2. **GlobalAlign Module**: This module tackles global misalignment by encoding the projected depth from LiDAR-to-camera and the neighboring depth through dual depth encoding. It generates a new reliable depth representation that incorporates neighbor information and aligns the global multi-modal features using learnable offsets. The authors evaluate GraphBEV on the nuScenes dataset, demonstrating state-of-the-art performance with an mAP of 70.1%, outperforming the baseline BEVFusion by 1.6% on the validation set. Notably, GraphBEV also outperforms BEVFusion by 8.3% under noisy misalignment conditions. The framework is shown to be robust to various factors such as weather conditions, ego distances, and object sizes, further enhancing its practical applicability in real-world scenarios.

GraphBEV: Towards Robust BEV Feature Alignment for Multi-Modal 3D Object Detection

2 Jul 2024 | Ziying Song, Lei Yang, Shaoqing Xu, Lin Liu, Dongyang Xu, Caiyan Jia, Feiyang Jia, and Li Wang