Understanding UniDepth%3A Universal Monocular Metric Depth Estimation

UniDepth is a novel approach for monocular metric depth estimation (MMDE) that aims to reconstruct metric 3D scenes from a single image across different domains. Unlike existing MMDE methods, UniDepth directly predicts metric 3D points from the input image without any additional information, making it a universal and flexible solution. The key contributions of UniDepth include: 1. **Self-Promptable Camera Module**: UniDepth incorporates a camera module that outputs a dense camera representation, which is used to condition depth features. This module allows the model to learn prior knowledge about the scene scale and camera intrinsics. 2. **Pseudo-Spherical Output Representation**: The model uses a pseudo-spherical representation of the output space, defined by azimuth and elevation angles, and depth. This representation disentangles the camera and depth dimensions, ensuring orthogonality and improving optimization. 3. **Geometric Invariance Loss**: A geometric invariance loss is introduced to promote the invariance of camera-conditioned depth features. This loss ensures that depth features remain consistent across different geometric augmentations of the same scene, enhancing robustness. UniDepth is evaluated on ten datasets in a zero-shot regime, demonstrating superior performance compared to state-of-the-art (SOTA) methods, even when tested on domains not seen during training. The model consistently outperforms SOTA methods in scale-invariant metrics and overall depth estimation, ranking first on the official KITTI Depth Prediction Benchmark. The code and models are available at [github.com/piccinelli-eth/unidepth].UniDepth is a novel approach for monocular metric depth estimation (MMDE) that aims to reconstruct metric 3D scenes from a single image across different domains. Unlike existing MMDE methods, UniDepth directly predicts metric 3D points from the input image without any additional information, making it a universal and flexible solution. The key contributions of UniDepth include: 1. **Self-Promptable Camera Module**: UniDepth incorporates a camera module that outputs a dense camera representation, which is used to condition depth features. This module allows the model to learn prior knowledge about the scene scale and camera intrinsics. 2. **Pseudo-Spherical Output Representation**: The model uses a pseudo-spherical representation of the output space, defined by azimuth and elevation angles, and depth. This representation disentangles the camera and depth dimensions, ensuring orthogonality and improving optimization. 3. **Geometric Invariance Loss**: A geometric invariance loss is introduced to promote the invariance of camera-conditioned depth features. This loss ensures that depth features remain consistent across different geometric augmentations of the same scene, enhancing robustness. UniDepth is evaluated on ten datasets in a zero-shot regime, demonstrating superior performance compared to state-of-the-art (SOTA) methods, even when tested on domains not seen during training. The model consistently outperforms SOTA methods in scale-invariant metrics and overall depth estimation, ranking first on the official KITTI Depth Prediction Benchmark. The code and models are available at [github.com/piccinelli-eth/unidepth].

UniDepth: Universal Monocular Metric Depth Estimation

27 Mar 2024 | Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, Fisher Yu