10 Apr 2017 | Arsalan Mousavian, Dragomir Anguelov, John Flynn, Jana Košecká
This paper presents a method for 3D object detection and pose estimation from a single image. Unlike existing methods that only estimate the 3D orientation of an object, this approach first estimates stable 3D object properties using a deep convolutional neural network (CNN), then combines these estimates with geometric constraints from a 2D bounding box to produce a complete 3D bounding box. The first network output estimates the 3D orientation using a novel hybrid discrete-continuous loss, which outperforms the L2 loss. The second output regresses the 3D object dimensions, which have low variance and can be predicted for many object types. These estimates, combined with geometric constraints from the 2D bounding box, enable the recovery of a stable and accurate 3D object pose. The method is evaluated on the KITTI and Pascal 3D+ datasets, showing superior performance compared to more complex approaches that use semantic segmentation, instance segmentation, and flat ground priors. The discrete-continuous loss also achieves state-of-the-art results for 3D viewpoint estimation on the Pascal 3D+ dataset. The main contributions include a method to estimate a 3D object's full pose and dimensions from a 2D bounding box using projective geometry and CNN-regressed orientation and size estimates, a novel discrete-continuous CNN architecture called MultiBin for orientation estimation, three new metrics for evaluating 3D boxes on the KITTI dataset, and an experimental evaluation demonstrating the effectiveness of the approach for KITTI cars. The method also shows strong performance on the Pascal 3D+ dataset for viewpoint estimation. The paper also discusses the importance of choosing suitable regression parameters within the estimation framework and evaluates the robustness of the method under different conditions. The results show that the method outperforms existing approaches in terms of 3D bounding box accuracy and orientation estimation.This paper presents a method for 3D object detection and pose estimation from a single image. Unlike existing methods that only estimate the 3D orientation of an object, this approach first estimates stable 3D object properties using a deep convolutional neural network (CNN), then combines these estimates with geometric constraints from a 2D bounding box to produce a complete 3D bounding box. The first network output estimates the 3D orientation using a novel hybrid discrete-continuous loss, which outperforms the L2 loss. The second output regresses the 3D object dimensions, which have low variance and can be predicted for many object types. These estimates, combined with geometric constraints from the 2D bounding box, enable the recovery of a stable and accurate 3D object pose. The method is evaluated on the KITTI and Pascal 3D+ datasets, showing superior performance compared to more complex approaches that use semantic segmentation, instance segmentation, and flat ground priors. The discrete-continuous loss also achieves state-of-the-art results for 3D viewpoint estimation on the Pascal 3D+ dataset. The main contributions include a method to estimate a 3D object's full pose and dimensions from a 2D bounding box using projective geometry and CNN-regressed orientation and size estimates, a novel discrete-continuous CNN architecture called MultiBin for orientation estimation, three new metrics for evaluating 3D boxes on the KITTI dataset, and an experimental evaluation demonstrating the effectiveness of the approach for KITTI cars. The method also shows strong performance on the Pascal 3D+ dataset for viewpoint estimation. The paper also discusses the importance of choosing suitable regression parameters within the estimation framework and evaluates the robustness of the method under different conditions. The results show that the method outperforms existing approaches in terms of 3D bounding box accuracy and orientation estimation.