1 Aug 2017 | Tinghui Zhou*, Matthew Brown, Noah Snavely, David G. Lowe
This paper presents an unsupervised learning framework for monocular depth and camera motion estimation from unstructured video sequences. Unlike previous methods that require ground-truth pose or depth for training, this approach uses only monocular video sequences. The method employs single-view depth and multi-view pose networks, with a loss function based on warping nearby views to the target using the computed depth and pose. The networks are coupled during training but can operate independently at test time. Empirical evaluation on the KITTI dataset shows that the method performs comparably to supervised methods and outperforms established SLAM systems under similar input settings. The paper also discusses related work, including structure from motion, warping-based view synthesis, and learning single-view 3D from registered 2D views. The approach is robust to various scene dynamics and occlusions, and the learned representations could be useful for other tasks such as object detection and semantic segmentation.This paper presents an unsupervised learning framework for monocular depth and camera motion estimation from unstructured video sequences. Unlike previous methods that require ground-truth pose or depth for training, this approach uses only monocular video sequences. The method employs single-view depth and multi-view pose networks, with a loss function based on warping nearby views to the target using the computed depth and pose. The networks are coupled during training but can operate independently at test time. Empirical evaluation on the KITTI dataset shows that the method performs comparably to supervised methods and outperforms established SLAM systems under similar input settings. The paper also discusses related work, including structure from motion, warping-based view synthesis, and learning single-view 3D from registered 2D views. The approach is robust to various scene dynamics and occlusions, and the learned representations could be useful for other tasks such as object detection and semantic segmentation.