SSD-6D: Making RGB-Based 3D Detection and 6D Pose Estimation Great Again

SSD-6D: Making RGB-Based 3D Detection and 6D Pose Estimation Great Again

27 Nov 2017 | Wadim Kehl, Fabian Manhardt, Federico Tombari, Slobodan Ilic, Nassir Navab
SSD-6D is a novel method for detecting 3D model instances and estimating their 6D poses from RGB data in a single shot. It extends the popular SSD paradigm to cover the full 6D pose space and trains on synthetic model data only. The method competes or surpasses current state-of-the-art methods that leverage RGB-D data on multiple challenging datasets. It achieves results at around 10Hz, which is much faster than related methods. The trained networks and detection code are publicly available for reproducibility. The method uses an RGB image as input and produces 2D detections with bounding boxes, along with a pool of the most likely 6D poses for each instance. The 6D pose is inferred from scores for viewpoint and in-plane rotation, and the results are refined and verified. The network architecture is based on a pre-trained InceptionV4 model, with multiple feature maps generated at different scales. Each feature map is convolved with prediction kernels to determine object class, 2D bounding box, and scores for possible viewpoints and in-plane rotations. The method uses a training stage that leverages synthetic 3D model data and decomposes the model pose space to handle symmetries. It also extends SSD to produce 2D detections and infer proper 6D poses. The method uses color information alone, which can provide close to perfect detection rates with good poses. It is compatible with RGB-D data, but treats depth as an optional modality for hypothesis verification and pose refinement. The method is evaluated on multiple datasets, including Tejani, LineMOD, and a multi-object dataset. It outperforms RGB-D methods on the Tejani dataset, achieving a 13.8% improvement. It also performs well on the LineMOD dataset, with average pose errors that are competitive with other methods. The method is faster than related work, with a runtime of approximately 85ms per object. It is scalable and can handle multiple objects in the network. The method is robust to occlusions and can handle objects with poor geometry or texture. However, it is sensitive to differences in color between synthetic models and real-world scenes. The method also struggles with small objects and requires a certain color similarity between synthetic renderings and scene appearances. The method is designed to be robust to color deviation between CAD models and scene appearances, and future work should focus on improving its robustness in this area.SSD-6D is a novel method for detecting 3D model instances and estimating their 6D poses from RGB data in a single shot. It extends the popular SSD paradigm to cover the full 6D pose space and trains on synthetic model data only. The method competes or surpasses current state-of-the-art methods that leverage RGB-D data on multiple challenging datasets. It achieves results at around 10Hz, which is much faster than related methods. The trained networks and detection code are publicly available for reproducibility. The method uses an RGB image as input and produces 2D detections with bounding boxes, along with a pool of the most likely 6D poses for each instance. The 6D pose is inferred from scores for viewpoint and in-plane rotation, and the results are refined and verified. The network architecture is based on a pre-trained InceptionV4 model, with multiple feature maps generated at different scales. Each feature map is convolved with prediction kernels to determine object class, 2D bounding box, and scores for possible viewpoints and in-plane rotations. The method uses a training stage that leverages synthetic 3D model data and decomposes the model pose space to handle symmetries. It also extends SSD to produce 2D detections and infer proper 6D poses. The method uses color information alone, which can provide close to perfect detection rates with good poses. It is compatible with RGB-D data, but treats depth as an optional modality for hypothesis verification and pose refinement. The method is evaluated on multiple datasets, including Tejani, LineMOD, and a multi-object dataset. It outperforms RGB-D methods on the Tejani dataset, achieving a 13.8% improvement. It also performs well on the LineMOD dataset, with average pose errors that are competitive with other methods. The method is faster than related work, with a runtime of approximately 85ms per object. It is scalable and can handle multiple objects in the network. The method is robust to occlusions and can handle objects with poor geometry or texture. However, it is sensitive to differences in color between synthetic models and real-world scenes. The method also struggles with small objects and requires a certain color similarity between synthetic renderings and scene appearances. The method is designed to be robust to color deviation between CAD models and scene appearances, and future work should focus on improving its robustness in this area.
Reach us at info@study.space
[slides and audio] SSD-6D%3A Making RGB-Based 3D Detection and 6D Pose Estimation Great Again