Unsupervised Monocular Depth Estimation with Left-Right Consistency

Unsupervised Monocular Depth Estimation with Left-Right Consistency

12 Apr 2017 | Clément Godard, Oisin Mac Aodha, Gabriel J. Brostow
This paper presents an unsupervised deep neural network for monocular depth estimation. Unlike existing methods that rely on ground truth depth data, the proposed method uses binocular stereo footage for training. The key innovation is a novel training objective that enforces consistency between left and right disparities, leading to improved depth estimation. The model is trained to synthesize depth as an intermediate step, without requiring any depth data. It uses a fully convolutional architecture inspired by DispNet, and incorporates a left-right consistency loss to enhance performance. The method achieves state-of-the-art results on the KITTI driving dataset, outperforming supervised methods. It is also shown to generalize well to other datasets, including a new outdoor urban dataset. The model is fast, with depth prediction for a 512x256 image taking around 35 milliseconds on a modern GPU. The method is evaluated on multiple datasets and compared to existing approaches, demonstrating superior performance in terms of accuracy and robustness. The results show that the proposed method can produce visually plausible depth maps even in challenging scenarios. The paper also discusses limitations, including artifacts at occlusion boundaries and issues with specular surfaces. Future work includes extending the model to videos and exploring alternative training signals.This paper presents an unsupervised deep neural network for monocular depth estimation. Unlike existing methods that rely on ground truth depth data, the proposed method uses binocular stereo footage for training. The key innovation is a novel training objective that enforces consistency between left and right disparities, leading to improved depth estimation. The model is trained to synthesize depth as an intermediate step, without requiring any depth data. It uses a fully convolutional architecture inspired by DispNet, and incorporates a left-right consistency loss to enhance performance. The method achieves state-of-the-art results on the KITTI driving dataset, outperforming supervised methods. It is also shown to generalize well to other datasets, including a new outdoor urban dataset. The model is fast, with depth prediction for a 512x256 image taking around 35 milliseconds on a modern GPU. The method is evaluated on multiple datasets and compared to existing approaches, demonstrating superior performance in terms of accuracy and robustness. The results show that the proposed method can produce visually plausible depth maps even in challenging scenarios. The paper also discusses limitations, including artifacts at occlusion boundaries and issues with specular surfaces. Future work includes extending the model to videos and exploring alternative training signals.
Reach us at info@study.space
[slides and audio] Unsupervised Monocular Depth Estimation with Left-Right Consistency