2016 | Ravi Garg, Vijay Kumar B.G., Gustavo Carneiro, and Ian Reid
This paper proposes an unsupervised convolutional neural network (CNN) for single-view depth estimation, which does not require pre-training or annotated ground-truth depths. The network is trained using a method analogous to an autoencoder, where a pair of images with known camera motion is used to generate an inverse warp of the target image based on predicted depth. The photometric error in the reconstruction is used as the loss function for the encoder. This approach avoids the need for manual annotation or calibration of depth sensors, and is shown to perform comparably to state-of-the-art supervised methods on the KITTI dataset.
The proposed method uses a stereo pair of images to train the CNN, with the encoder predicting depth maps for the source image. The decoder then forces the encoder output to be disparities by synthesizing a backward warp image. The reconstruction loss is minimized with a smoothness prior on disparities to handle the aperture problem. The network is trained end-to-end, using a combination of reconstruction loss and smoothness regularization.
The approach is evaluated on the KITTI dataset, where the network trained on less than half of the dataset achieves performance comparable to supervised methods. The method is shown to be effective in both training and testing, with the network performing well on various depth estimation metrics. The results demonstrate that the proposed unsupervised CNN can achieve state-of-the-art performance in single-view depth estimation without the need for labeled data or manual annotation. The method is also shown to be effective in handling real-world scenarios, with the network able to generalize well to new data. The paper concludes that the proposed approach is a promising alternative to supervised methods for single-view depth estimation, and that further research is needed to improve the performance of the network.This paper proposes an unsupervised convolutional neural network (CNN) for single-view depth estimation, which does not require pre-training or annotated ground-truth depths. The network is trained using a method analogous to an autoencoder, where a pair of images with known camera motion is used to generate an inverse warp of the target image based on predicted depth. The photometric error in the reconstruction is used as the loss function for the encoder. This approach avoids the need for manual annotation or calibration of depth sensors, and is shown to perform comparably to state-of-the-art supervised methods on the KITTI dataset.
The proposed method uses a stereo pair of images to train the CNN, with the encoder predicting depth maps for the source image. The decoder then forces the encoder output to be disparities by synthesizing a backward warp image. The reconstruction loss is minimized with a smoothness prior on disparities to handle the aperture problem. The network is trained end-to-end, using a combination of reconstruction loss and smoothness regularization.
The approach is evaluated on the KITTI dataset, where the network trained on less than half of the dataset achieves performance comparable to supervised methods. The method is shown to be effective in both training and testing, with the network performing well on various depth estimation metrics. The results demonstrate that the proposed unsupervised CNN can achieve state-of-the-art performance in single-view depth estimation without the need for labeled data or manual annotation. The method is also shown to be effective in handling real-world scenarios, with the network able to generalize well to new data. The paper concludes that the proposed approach is a promising alternative to supervised methods for single-view depth estimation, and that further research is needed to improve the performance of the network.