1 Aug 2017 | Tinghui Zhou*, Matthew Brown, Noah Snavely, David G. Lowe
This paper presents an unsupervised learning framework for monocular depth and camera motion estimation from unstructured video sequences. The method uses an end-to-end learning approach with view synthesis as the supervisory signal. Unlike previous work that requires labeled data, this method is completely unsupervised, requiring only monocular video sequences for training. The framework uses single-view depth and multiview pose networks, with a loss based on warping nearby views to the target using the computed depth and pose. The networks are coupled during training but can be applied independently at test time. Empirical evaluation on the KITTI dataset shows that the method performs comparably with supervised methods that use ground-truth pose or depth for training, and that pose estimation performs favorably compared to established SLAM systems under comparable input settings.
The approach is inspired by the way humans infer ego-motion and scene structure from visual experience. The method mimics this by training a model that observes sequences of images and aims to explain observations by predicting likely camera motion and scene structure. The model maps directly from input pixels to an estimate of ego-motion (parameterized as 6-DoF transformation matrices) and the underlying scene structure (parameterized as per-pixel depth maps under a reference view). The method is unsupervised and can be trained using sequences of images with no manual labeling or camera motion information.
The framework builds on the insight that a geometric view synthesis system performs well when its intermediate predictions of scene geometry and camera poses correspond to the physical ground truth. The method uses a differentiable depth image-based rendering approach to reconstruct the target view by sampling pixels from a source view based on the predicted depth map and relative pose. An explainability prediction network is also trained to output a per-pixel soft mask indicating the network's belief in where direct view synthesis will be successfully modeled for each target pixel.
The method is evaluated on the KITTI dataset and shows comparable performance with supervised methods for single-view depth estimation and ego-motion estimation. The method is also tested on the Make3D dataset and shows reasonable performance in capturing the global scene layout without any training on the Make3D images. The method outperforms baselines in pose estimation but falls short of ORB-SLAM (full) which leverages whole sequences for loop closure and re-localization. The method is also compared with other approaches in terms of visual performance and shows that it can produce results comparable to supervised baselines. The method is shown to be effective in learning depth and camera motion from unstructured video sequences without any supervision.This paper presents an unsupervised learning framework for monocular depth and camera motion estimation from unstructured video sequences. The method uses an end-to-end learning approach with view synthesis as the supervisory signal. Unlike previous work that requires labeled data, this method is completely unsupervised, requiring only monocular video sequences for training. The framework uses single-view depth and multiview pose networks, with a loss based on warping nearby views to the target using the computed depth and pose. The networks are coupled during training but can be applied independently at test time. Empirical evaluation on the KITTI dataset shows that the method performs comparably with supervised methods that use ground-truth pose or depth for training, and that pose estimation performs favorably compared to established SLAM systems under comparable input settings.
The approach is inspired by the way humans infer ego-motion and scene structure from visual experience. The method mimics this by training a model that observes sequences of images and aims to explain observations by predicting likely camera motion and scene structure. The model maps directly from input pixels to an estimate of ego-motion (parameterized as 6-DoF transformation matrices) and the underlying scene structure (parameterized as per-pixel depth maps under a reference view). The method is unsupervised and can be trained using sequences of images with no manual labeling or camera motion information.
The framework builds on the insight that a geometric view synthesis system performs well when its intermediate predictions of scene geometry and camera poses correspond to the physical ground truth. The method uses a differentiable depth image-based rendering approach to reconstruct the target view by sampling pixels from a source view based on the predicted depth map and relative pose. An explainability prediction network is also trained to output a per-pixel soft mask indicating the network's belief in where direct view synthesis will be successfully modeled for each target pixel.
The method is evaluated on the KITTI dataset and shows comparable performance with supervised methods for single-view depth estimation and ego-motion estimation. The method is also tested on the Make3D dataset and shows reasonable performance in capturing the global scene layout without any training on the Make3D images. The method outperforms baselines in pose estimation but falls short of ORB-SLAM (full) which leverages whole sequences for loop closure and re-localization. The method is also compared with other approaches in terms of visual performance and shows that it can produce results comparable to supervised baselines. The method is shown to be effective in learning depth and camera motion from unstructured video sequences without any supervision.