Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer

Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer

2020 | René Ranftl*, Katrin Lasinger*, David Hafner, Konrad Schindler, and Vladlen Koltun
This paper presents a method for robust monocular depth estimation by combining multiple datasets during training, even when their annotations are incompatible. The authors propose a training objective that is invariant to changes in depth range and scale, advocate the use of multi-objective learning to combine data from different sources, and highlight the importance of pretraining encoders on auxiliary tasks. They experiment with five diverse training datasets, including a new, massive data source: 3D films. To demonstrate the generalization power of their approach, they use zero-shot cross-dataset transfer, i.e., evaluating on datasets not seen during training. The experiments confirm that mixing data from complementary sources greatly improves monocular depth estimation. Their approach clearly outperforms competing methods across diverse datasets, setting a new state of the art for monocular depth estimation. The paper discusses the challenges of acquiring dense ground-truth depth across different environments at scale, leading to the emergence of datasets with distinct characteristics and biases. The authors propose tools to mix multiple datasets during training, including a robust training objective that is invariant to changes in depth range and scale. They also quantify the value of existing datasets for monocular depth estimation and explore optimal strategies for mixing datasets during training. The authors show that a principled approach based on multi-objective optimization leads to improved results compared to a naive mixing strategy. They further highlight the importance of high-capacity encoders and show the unreasonable effectiveness of pretraining the encoder on a large-scale auxiliary task. The authors propose a new data source: 3D movies, which provide a massive source of diverse data. They describe the challenges of using movie data, including the limited depth budget and the unknown baseline and convergence angle between the cameras of the stereo rig. They also describe the data extraction and training procedures that address these challenges. The authors evaluate the usefulness of different training datasets for generalization in tables and find that mixing multiple training sets consistently improves performance with respect to the baseline. However, they also see that adding datasets does not unconditionally improve performance when naive mixing is used. They show that Pareto-optimal mixing leads to better results than naive mixing. The authors compare their best-performing model to various state-of-the art approaches and find that their model outperforms the baselines by a comfortable margin in terms of zero-shot performance. They also show that the strong performance of their model is not only due to increased network capacity, but fundamentally due to the proposed training scheme. The authors identify common failure cases and biases of their model, including natural biases in images and failure cases in depth estimation. They also show additional qualitative results on various datasets and highlight the model's ability to estimate plausible relative depth even on relatively abstract inputs. The authors conclude that their work advances the state of the art in generic monocular depth estimation and indicates that the presented ideas substantially improve performance across diverse environments. They hope that this work will contribute to the deployment of monocular depth models that meet the requirements of practical applications. TheirThis paper presents a method for robust monocular depth estimation by combining multiple datasets during training, even when their annotations are incompatible. The authors propose a training objective that is invariant to changes in depth range and scale, advocate the use of multi-objective learning to combine data from different sources, and highlight the importance of pretraining encoders on auxiliary tasks. They experiment with five diverse training datasets, including a new, massive data source: 3D films. To demonstrate the generalization power of their approach, they use zero-shot cross-dataset transfer, i.e., evaluating on datasets not seen during training. The experiments confirm that mixing data from complementary sources greatly improves monocular depth estimation. Their approach clearly outperforms competing methods across diverse datasets, setting a new state of the art for monocular depth estimation. The paper discusses the challenges of acquiring dense ground-truth depth across different environments at scale, leading to the emergence of datasets with distinct characteristics and biases. The authors propose tools to mix multiple datasets during training, including a robust training objective that is invariant to changes in depth range and scale. They also quantify the value of existing datasets for monocular depth estimation and explore optimal strategies for mixing datasets during training. The authors show that a principled approach based on multi-objective optimization leads to improved results compared to a naive mixing strategy. They further highlight the importance of high-capacity encoders and show the unreasonable effectiveness of pretraining the encoder on a large-scale auxiliary task. The authors propose a new data source: 3D movies, which provide a massive source of diverse data. They describe the challenges of using movie data, including the limited depth budget and the unknown baseline and convergence angle between the cameras of the stereo rig. They also describe the data extraction and training procedures that address these challenges. The authors evaluate the usefulness of different training datasets for generalization in tables and find that mixing multiple training sets consistently improves performance with respect to the baseline. However, they also see that adding datasets does not unconditionally improve performance when naive mixing is used. They show that Pareto-optimal mixing leads to better results than naive mixing. The authors compare their best-performing model to various state-of-the art approaches and find that their model outperforms the baselines by a comfortable margin in terms of zero-shot performance. They also show that the strong performance of their model is not only due to increased network capacity, but fundamentally due to the proposed training scheme. The authors identify common failure cases and biases of their model, including natural biases in images and failure cases in depth estimation. They also show additional qualitative results on various datasets and highlight the model's ability to estimate plausible relative depth even on relatively abstract inputs. The authors conclude that their work advances the state of the art in generic monocular depth estimation and indicates that the presented ideas substantially improve performance across diverse environments. They hope that this work will contribute to the deployment of monocular depth models that meet the requirements of practical applications. Their
Reach us at info@study.space