Multi-View 3D Object Detection Network for Autonomous Driving

Multi-View 3D Object Detection Network for Autonomous Driving

22 Jun 2017 | Xiaozhi Chen1, Huimin Ma1, Ji Wan2, Bo Li2, Tian Xia2
This paper proposes a multi-view 3D object detection network (MV3D) for autonomous driving, which integrates LIDAR point cloud and RGB images to predict oriented 3D bounding boxes. The network consists of two subnetworks: a 3D proposal network and a region-based fusion network. The proposal network generates 3D candidate boxes from the bird's eye view representation of the point cloud, while the fusion network combines multi-view features to enable interactions between intermediate layers. The network uses a deep fusion approach to combine region-wise features from multiple views, and incorporates drop-path training and auxiliary loss for improved performance. Experiments on the KITTI benchmark show that MV3D outperforms state-of-the-art methods by around 25% in 3D localization and 30% in 3D detection. For 2D detection, it achieves a 10.3% higher AP than other LIDAR-based methods on the hard data. The network also performs well in 3D object detection, achieving 87.65% AP on the moderate setting and 89.05% AP on the hard setting with IoU=0.5. The approach shows significant improvements over existing methods in both 3D localization and detection, and the 2D detection results from 3D detections are competitive with state-of-the-art 2D detection methods. The network is trained end-to-end and achieves a inference time of around 0.36 seconds per image on a GeForce Titan X GPU. The results demonstrate the effectiveness of the multi-view fusion approach in improving 3D object detection in autonomous driving scenarios.This paper proposes a multi-view 3D object detection network (MV3D) for autonomous driving, which integrates LIDAR point cloud and RGB images to predict oriented 3D bounding boxes. The network consists of two subnetworks: a 3D proposal network and a region-based fusion network. The proposal network generates 3D candidate boxes from the bird's eye view representation of the point cloud, while the fusion network combines multi-view features to enable interactions between intermediate layers. The network uses a deep fusion approach to combine region-wise features from multiple views, and incorporates drop-path training and auxiliary loss for improved performance. Experiments on the KITTI benchmark show that MV3D outperforms state-of-the-art methods by around 25% in 3D localization and 30% in 3D detection. For 2D detection, it achieves a 10.3% higher AP than other LIDAR-based methods on the hard data. The network also performs well in 3D object detection, achieving 87.65% AP on the moderate setting and 89.05% AP on the hard setting with IoU=0.5. The approach shows significant improvements over existing methods in both 3D localization and detection, and the 2D detection results from 3D detections are competitive with state-of-the-art 2D detection methods. The network is trained end-to-end and achieves a inference time of around 0.36 seconds per image on a GeForce Titan X GPU. The results demonstrate the effectiveness of the multi-view fusion approach in improving 3D object detection in autonomous driving scenarios.
Reach us at info@study.space