27 Mar 2024 | Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, Fisher Yu
UniDepth is a novel model for universal monocular metric depth estimation (MMDE), capable of reconstructing metric 3D scenes from a single image across various domains. Unlike existing methods that require training and testing on datasets with similar camera intrinsics and scene scales, UniDepth directly predicts metric 3D points from the input image without any additional information, aiming for a universal and flexible MMDE solution. It introduces a self-promptable camera module that predicts dense camera representation to condition depth features. The model uses a pseudo-spherical output representation, which disentangles camera and depth representations. Additionally, a geometric invariance loss is proposed to promote the invariance of camera-prompted depth features. Evaluations on ten datasets in a zero-shot regime consistently demonstrate the superior performance of UniDepth, even when compared with methods trained on the testing domains. The model is designed to be robust against camera noise and generalizes well across different domains. UniDepth sets a new state of the art in multiple benchmarks, even surpassing in-domain trained methods, showcasing its robustness and effectiveness in MMDE.UniDepth is a novel model for universal monocular metric depth estimation (MMDE), capable of reconstructing metric 3D scenes from a single image across various domains. Unlike existing methods that require training and testing on datasets with similar camera intrinsics and scene scales, UniDepth directly predicts metric 3D points from the input image without any additional information, aiming for a universal and flexible MMDE solution. It introduces a self-promptable camera module that predicts dense camera representation to condition depth features. The model uses a pseudo-spherical output representation, which disentangles camera and depth representations. Additionally, a geometric invariance loss is proposed to promote the invariance of camera-prompted depth features. Evaluations on ten datasets in a zero-shot regime consistently demonstrate the superior performance of UniDepth, even when compared with methods trained on the testing domains. The model is designed to be robust against camera noise and generalizes well across different domains. UniDepth sets a new state of the art in multiple benchmarks, even surpassing in-domain trained methods, showcasing its robustness and effectiveness in MMDE.