5 Jan 2024 | Ziying Song, Guoxin Zhang, Jun Xie, Lin Liu, Caiyan Jia, Shaoqing Xu, Zhepeng Wang
**VoxelNextFusion: A Simple, Unified and Effective Voxel Fusion Framework for Multi-Modal 3D Object Detection**
**Abstract:**
LiDAR-camera fusion enhances 3D object detection by leveraging complementary information from depth-aware LiDAR points and semantically rich images. Existing voxel-based methods struggle with one-to-one fusion of sparse voxel features and dense image features, leading to sub-optimal performance, especially at long distances. This paper introduces VoxelNextFusion, a multi-modal 3D object detection framework that effectively bridges the gap between sparse point clouds and dense images. It proposes a voxel-based image pipeline that projects point clouds onto images to obtain pixel- and patch-level features, which are then fused using self-attention. A feature importance module distinguishes between foreground and background features to minimize the impact of background features. Extensive experiments on KITTI and nuScenes datasets show that VoxelNextFusion improves AP@0.7 by around 3.20% for car detection in hard scenarios compared to the Voxel R-CNN baseline.
**Keywords:**
3D object detection, multi-modal fusion, patch fusion
**Introduction:**
3D object detection is crucial for autonomous driving, but single-modal methods have limitations. LiDAR captures sparse point clouds, lacking context for distant or occluded objects, while cameras provide rich semantic information but lack depth. Multi-modal methods combine these advantages. Current methods primarily use point cloud pipelines with image pipelines as supplements. Voxel-based methods convert point clouds into structured data for feature extraction, but one-to-one mapping loses image semantics and continuity. VoxelNextFusion addresses these issues by proposing P²-Fusion and FB-Fusion modules. P²-Fusion combines one-to-many and one-to-one mappings to improve feature density and fusion accuracy. FB-Fusion differentiates between foreground and background features to enhance the exploitation of important features.
**Related Work:**
Single-modal 3D object detection methods use either LiDAR or cameras, each with its limitations. Multi-modal methods combine data from different modalities to improve performance. Voxel-based methods convert point clouds into structured data for efficient feature extraction.
**VoxelNextFusion:**
- **P²-Fusion (Patch-Point Fusion):** Projects point clouds onto images and fuses voxel features with image features using self-attention.
- **FB-Fusion (Foreground-Background Fusion):** Distinguishes between foreground and background features to enhance the density and relevance of foreground features.
**Experiments:**
- **KITTI Dataset:** VoxelNextFusion outperforms baselines on KITTI with improvements in AP for car, pedestrian, and cyclist categories.
- **nuScenes Dataset:** VoxelNextFusion achieves significant improvements over baselines on the nuScenes dataset, especially for small and long-range objects.
**Ablation Study:**
- **Effect of Sub-modules:** P²-F**VoxelNextFusion: A Simple, Unified and Effective Voxel Fusion Framework for Multi-Modal 3D Object Detection**
**Abstract:**
LiDAR-camera fusion enhances 3D object detection by leveraging complementary information from depth-aware LiDAR points and semantically rich images. Existing voxel-based methods struggle with one-to-one fusion of sparse voxel features and dense image features, leading to sub-optimal performance, especially at long distances. This paper introduces VoxelNextFusion, a multi-modal 3D object detection framework that effectively bridges the gap between sparse point clouds and dense images. It proposes a voxel-based image pipeline that projects point clouds onto images to obtain pixel- and patch-level features, which are then fused using self-attention. A feature importance module distinguishes between foreground and background features to minimize the impact of background features. Extensive experiments on KITTI and nuScenes datasets show that VoxelNextFusion improves AP@0.7 by around 3.20% for car detection in hard scenarios compared to the Voxel R-CNN baseline.
**Keywords:**
3D object detection, multi-modal fusion, patch fusion
**Introduction:**
3D object detection is crucial for autonomous driving, but single-modal methods have limitations. LiDAR captures sparse point clouds, lacking context for distant or occluded objects, while cameras provide rich semantic information but lack depth. Multi-modal methods combine these advantages. Current methods primarily use point cloud pipelines with image pipelines as supplements. Voxel-based methods convert point clouds into structured data for feature extraction, but one-to-one mapping loses image semantics and continuity. VoxelNextFusion addresses these issues by proposing P²-Fusion and FB-Fusion modules. P²-Fusion combines one-to-many and one-to-one mappings to improve feature density and fusion accuracy. FB-Fusion differentiates between foreground and background features to enhance the exploitation of important features.
**Related Work:**
Single-modal 3D object detection methods use either LiDAR or cameras, each with its limitations. Multi-modal methods combine data from different modalities to improve performance. Voxel-based methods convert point clouds into structured data for efficient feature extraction.
**VoxelNextFusion:**
- **P²-Fusion (Patch-Point Fusion):** Projects point clouds onto images and fuses voxel features with image features using self-attention.
- **FB-Fusion (Foreground-Background Fusion):** Distinguishes between foreground and background features to enhance the density and relevance of foreground features.
**Experiments:**
- **KITTI Dataset:** VoxelNextFusion outperforms baselines on KITTI with improvements in AP for car, pedestrian, and cyclist categories.
- **nuScenes Dataset:** VoxelNextFusion achieves significant improvements over baselines on the nuScenes dataset, especially for small and long-range objects.
**Ablation Study:**
- **Effect of Sub-modules:** P²-F