10 Apr 2017 | Arsalan Mousavian, Dragomir Anguelov, John Flynn, Jana Košecká
The paper presents a method for 3D object detection and pose estimation from a single image, focusing on estimating the 3D orientation and dimensions of an object using deep learning. Unlike existing techniques that only regress the 3D orientation, this method first regress relatively stable 3D object properties using a deep convolutional neural network (CNN) and then combines these estimates with geometric constraints provided by a 2D object bounding box to produce a complete 3D bounding box. The first network output estimates the 3D object orientation using a novel hybrid discrete-continuous loss, which significantly outperforms the L2 loss. The second output regresses the 3D object dimensions, which have relatively little variance and can often be predicted for many object types. These estimates, combined with the geometric constraints on translation imposed by the 2D bounding box, enable the recovery of a stable and accurate 3D object pose. The method is evaluated on the KITTI object detection benchmark, both on the official metric of 3D orientation estimation and the accuracy of the obtained 3D bounding boxes. Despite being conceptually simple, the method outperforms more complex and computationally expensive approaches that leverage semantic segmentation, instance-level segmentation, and flat ground priors. The discrete-continuous loss also produces state-of-the-art results for 3D viewpoint estimation on the Pascal 3D+ dataset. The main contributions include a method to estimate the full 3D pose and dimensions from a 2D bounding box, a novel *MultiBin* discrete-continuous CNN architecture for orientation estimation, three new metrics for evaluating 3D boxes, and experimental evaluations demonstrating the effectiveness of the approach.The paper presents a method for 3D object detection and pose estimation from a single image, focusing on estimating the 3D orientation and dimensions of an object using deep learning. Unlike existing techniques that only regress the 3D orientation, this method first regress relatively stable 3D object properties using a deep convolutional neural network (CNN) and then combines these estimates with geometric constraints provided by a 2D object bounding box to produce a complete 3D bounding box. The first network output estimates the 3D object orientation using a novel hybrid discrete-continuous loss, which significantly outperforms the L2 loss. The second output regresses the 3D object dimensions, which have relatively little variance and can often be predicted for many object types. These estimates, combined with the geometric constraints on translation imposed by the 2D bounding box, enable the recovery of a stable and accurate 3D object pose. The method is evaluated on the KITTI object detection benchmark, both on the official metric of 3D orientation estimation and the accuracy of the obtained 3D bounding boxes. Despite being conceptually simple, the method outperforms more complex and computationally expensive approaches that leverage semantic segmentation, instance-level segmentation, and flat ground priors. The discrete-continuous loss also produces state-of-the-art results for 3D viewpoint estimation on the Pascal 3D+ dataset. The main contributions include a method to estimate the full 3D pose and dimensions from a 2D bounding box, a novel *MultiBin* discrete-continuous CNN architecture for orientation estimation, three new metrics for evaluating 3D boxes, and experimental evaluations demonstrating the effectiveness of the approach.