12 Apr 2017 | Clément Godard, Oisin Mac Aodha, Gabriel J. Brostow
This paper presents an unsupervised method for monocular depth estimation using left-right consistency. The authors propose a novel training objective that enables a convolutional neural network to learn depth estimation without explicit ground truth depth data. By exploiting epipolar geometry constraints, the network is trained to generate disparity images using an image reconstruction loss. To improve the quality of depth images, a new training loss is introduced that enforces consistency between disparities produced relative to both the left and right images. The method outperforms existing supervised methods on the KITTI driving dataset and generalizes to other datasets, including a new outdoor urban dataset collected by the authors. The network architecture is inspired by the DispNet architecture but includes modifications to enable unsupervised training. The paper also discusses the limitations of the method, such as the need for rectified and temporally aligned stereo pairs during training and the inability to handle specular and transparent surfaces.This paper presents an unsupervised method for monocular depth estimation using left-right consistency. The authors propose a novel training objective that enables a convolutional neural network to learn depth estimation without explicit ground truth depth data. By exploiting epipolar geometry constraints, the network is trained to generate disparity images using an image reconstruction loss. To improve the quality of depth images, a new training loss is introduced that enforces consistency between disparities produced relative to both the left and right images. The method outperforms existing supervised methods on the KITTI driving dataset and generalizes to other datasets, including a new outdoor urban dataset collected by the authors. The network architecture is inspired by the DispNet architecture but includes modifications to enable unsupervised training. The paper also discusses the limitations of the method, such as the need for rectified and temporally aligned stereo pairs during training and the inability to handle specular and transparent surfaces.