9 Apr 2021 | Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, Jianping Shi, Xiaogang Wang, Hongsheng Li
PV-RCNN is a novel and high-performance 3D object detection framework that integrates 3D voxel Convolutional Neural Networks (CNN) and PointNet-based set abstraction to learn more discriminative point cloud features. The framework uses a voxel-to-keypoint scene encoding strategy to summarize 3D scene information into a small set of keypoints, reducing computational costs and encoding representative scene features. It then employs a keypoint-to-grid RoI feature abstraction to extract proposal-specific features from the keypoints, enabling accurate estimation of object confidences and locations. The RoI-grid pooling method, which uses keypoint set abstraction with multiple receptive fields, encodes richer context information for improved detection performance. Extensive experiments on the KITTI and Waymo Open datasets show that PV-RCNN outperforms state-of-the-art methods with significant margins, achieving first place on the KITTI 3D detection benchmark. The framework effectively combines the strengths of both voxel-based and point-based feature learning methods, leading to improved 3D object detection performance with manageable memory consumption. Key contributions include the voxel-to-keypoint scene encoding scheme, multi-scale RoI feature abstraction layer, and the integration of both voxel-based and point-based networks for discriminative feature learning. The method demonstrates strong performance in 3D object detection, particularly in challenging scenarios, and shows robustness across different datasets.PV-RCNN is a novel and high-performance 3D object detection framework that integrates 3D voxel Convolutional Neural Networks (CNN) and PointNet-based set abstraction to learn more discriminative point cloud features. The framework uses a voxel-to-keypoint scene encoding strategy to summarize 3D scene information into a small set of keypoints, reducing computational costs and encoding representative scene features. It then employs a keypoint-to-grid RoI feature abstraction to extract proposal-specific features from the keypoints, enabling accurate estimation of object confidences and locations. The RoI-grid pooling method, which uses keypoint set abstraction with multiple receptive fields, encodes richer context information for improved detection performance. Extensive experiments on the KITTI and Waymo Open datasets show that PV-RCNN outperforms state-of-the-art methods with significant margins, achieving first place on the KITTI 3D detection benchmark. The framework effectively combines the strengths of both voxel-based and point-based feature learning methods, leading to improved 3D object detection performance with manageable memory consumption. Key contributions include the voxel-to-keypoint scene encoding scheme, multi-scale RoI feature abstraction layer, and the integration of both voxel-based and point-based networks for discriminative feature learning. The method demonstrates strong performance in 3D object detection, particularly in challenging scenarios, and shows robustness across different datasets.