13 Mar 2017 | Alex Kendall, Hayk Martirosyan, Saumitro Dasgupta, Peter Henry, Ryan Kennedy, Abraham Bachrach, Adam Bry
This paper presents a novel end-to-end deep learning architecture for stereo disparity regression. The proposed method, called GC-Net (Geometry and Context Network), leverages geometric knowledge to form a cost volume using deep feature representations. It incorporates contextual information using 3D convolutions over this volume. Disparity values are regressed from the cost volume using a differentiable soft argmin operation, enabling end-to-end training to sub-pixel accuracy without additional post-processing or regularization. The method is evaluated on the Scene Flow and KITTI datasets, achieving a new state-of-the-art benchmark on KITTI while being significantly faster than competing approaches.
The paper introduces a stereo regression model that can be trained end-to-end to understand wider contextual information. Traditional stereo algorithms often struggle with textureless areas, reflective surfaces, thin structures, and repetitive patterns. Deep learning models have shown success in learning powerful representations directly from raw data in tasks like object classification, detection, and semantic segmentation. The proposed method uses a deep convolutional network to learn semantic context, which helps in understanding the geometry of the scene.
The architecture of GC-Net is designed to explicitly reason about geometry by forming a cost volume and also about semantics using a deep convolutional network formulation. The key ideas include learning to incorporate context directly from the data using 3D convolutions and using a soft argmin function to regress sub-pixel disparity values from the disparity cost volume.
The method is evaluated on the synthetic Scene Flow dataset and the KITTI 2012 and 2015 datasets, achieving state-of-the-art results. The model is shown to be able to learn semantic reasoning and contextual information, which improves performance in challenging scenarios such as reflective, thin, or textureless surfaces. The model is also faster than many competing approaches and does not require post-processing or regularization. The paper concludes that the proposed method effectively regresses stereo disparity using a combination of geometric and contextual information, achieving sub-pixel accuracy and outperforming classification approaches.This paper presents a novel end-to-end deep learning architecture for stereo disparity regression. The proposed method, called GC-Net (Geometry and Context Network), leverages geometric knowledge to form a cost volume using deep feature representations. It incorporates contextual information using 3D convolutions over this volume. Disparity values are regressed from the cost volume using a differentiable soft argmin operation, enabling end-to-end training to sub-pixel accuracy without additional post-processing or regularization. The method is evaluated on the Scene Flow and KITTI datasets, achieving a new state-of-the-art benchmark on KITTI while being significantly faster than competing approaches.
The paper introduces a stereo regression model that can be trained end-to-end to understand wider contextual information. Traditional stereo algorithms often struggle with textureless areas, reflective surfaces, thin structures, and repetitive patterns. Deep learning models have shown success in learning powerful representations directly from raw data in tasks like object classification, detection, and semantic segmentation. The proposed method uses a deep convolutional network to learn semantic context, which helps in understanding the geometry of the scene.
The architecture of GC-Net is designed to explicitly reason about geometry by forming a cost volume and also about semantics using a deep convolutional network formulation. The key ideas include learning to incorporate context directly from the data using 3D convolutions and using a soft argmin function to regress sub-pixel disparity values from the disparity cost volume.
The method is evaluated on the synthetic Scene Flow dataset and the KITTI 2012 and 2015 datasets, achieving state-of-the-art results. The model is shown to be able to learn semantic reasoning and contextual information, which improves performance in challenging scenarios such as reflective, thin, or textureless surfaces. The model is also faster than many competing approaches and does not require post-processing or regularization. The paper concludes that the proposed method effectively regresses stereo disparity using a combination of geometric and contextual information, achieving sub-pixel accuracy and outperforming classification approaches.