Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

7 Apr 2024 | Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao
Depth Anything is a practical solution for robust monocular depth estimation. The paper presents a foundation model that can estimate depth information from any image under any circumstances. The model leverages large-scale unlabeled data to improve generalization and robustness. By using a data engine to collect and automatically annotate large-scale unlabeled data, the model significantly expands data coverage and reduces generalization error. Two strategies are proposed: first, a more challenging optimization target is created by leveraging data augmentation tools, which compels the model to actively seek extra visual knowledge and acquire robust representations. Second, an auxiliary supervision is developed to enforce the model to inherit rich semantic priors from pre-trained encoders. The model is evaluated on six public datasets and randomly captured photos, demonstrating impressive generalization ability. Further, through fine-tuning with metric depth information from NYUv2 and KITTI, new state-of-the-art results are achieved. The model also results in a better depth-conditioned ControlNet. The model is released for use. The paper highlights the value of data scaling-up of massive, cheap, and diverse unlabeled images for monocular depth estimation. It points out a key practice in jointly training large-scale labeled and unlabeled images. Instead of learning raw unlabeled images directly, the model is challenged with a harder optimization target for extra knowledge. The paper also proposes to inherit rich semantic priors from pre-trained encoders for better scene understanding, rather than using an auxiliary semantic segmentation task. The model exhibits stronger zero-shot capability than MiDaS-BEiT_L-512. Further, fine-tuned with metric depth, it outperforms ZoeDepth significantly. The paper also explores the use of unlabeled data in semi-supervised learning, which is popular in various applications. It demonstrates that unlabeled images can significantly enhance data coverage and thus improve model generalization and robustness. The paper presents a framework that utilizes both labeled and unlabeled images to facilitate better monocular depth estimation. The labeled and unlabeled sets are denoted as D^l and D^u, respectively. The model is trained on a combination of labeled and pseudo-labeled images. The paper also introduces two forms of perturbations: strong color distortions and strong spatial distortion. These modifications significantly improve the baseline of labeled images. The paper also explores the use of semantic priors from pre-trained encoders to enhance depth estimation. It proposes to transfer the strong semantic capability of DINOv2 to the depth model with an auxiliary feature alignment loss. This not only enhances the MDE performance but also yields a multi-task encoder for both middle-level and high-level perception tasks. The paper evaluates the model on six unseen datasets, demonstrating its strong performance in zero-shot depth estimation. The model is also fine-tuned for metric depth estimation and semantic segmentation, achieving state-of-the-art results. The paper concludes that the model is a promising solution for robust monocular depth estimation and can serve as a genericDepth Anything is a practical solution for robust monocular depth estimation. The paper presents a foundation model that can estimate depth information from any image under any circumstances. The model leverages large-scale unlabeled data to improve generalization and robustness. By using a data engine to collect and automatically annotate large-scale unlabeled data, the model significantly expands data coverage and reduces generalization error. Two strategies are proposed: first, a more challenging optimization target is created by leveraging data augmentation tools, which compels the model to actively seek extra visual knowledge and acquire robust representations. Second, an auxiliary supervision is developed to enforce the model to inherit rich semantic priors from pre-trained encoders. The model is evaluated on six public datasets and randomly captured photos, demonstrating impressive generalization ability. Further, through fine-tuning with metric depth information from NYUv2 and KITTI, new state-of-the-art results are achieved. The model also results in a better depth-conditioned ControlNet. The model is released for use. The paper highlights the value of data scaling-up of massive, cheap, and diverse unlabeled images for monocular depth estimation. It points out a key practice in jointly training large-scale labeled and unlabeled images. Instead of learning raw unlabeled images directly, the model is challenged with a harder optimization target for extra knowledge. The paper also proposes to inherit rich semantic priors from pre-trained encoders for better scene understanding, rather than using an auxiliary semantic segmentation task. The model exhibits stronger zero-shot capability than MiDaS-BEiT_L-512. Further, fine-tuned with metric depth, it outperforms ZoeDepth significantly. The paper also explores the use of unlabeled data in semi-supervised learning, which is popular in various applications. It demonstrates that unlabeled images can significantly enhance data coverage and thus improve model generalization and robustness. The paper presents a framework that utilizes both labeled and unlabeled images to facilitate better monocular depth estimation. The labeled and unlabeled sets are denoted as D^l and D^u, respectively. The model is trained on a combination of labeled and pseudo-labeled images. The paper also introduces two forms of perturbations: strong color distortions and strong spatial distortion. These modifications significantly improve the baseline of labeled images. The paper also explores the use of semantic priors from pre-trained encoders to enhance depth estimation. It proposes to transfer the strong semantic capability of DINOv2 to the depth model with an auxiliary feature alignment loss. This not only enhances the MDE performance but also yields a multi-task encoder for both middle-level and high-level perception tasks. The paper evaluates the model on six unseen datasets, demonstrating its strong performance in zero-shot depth estimation. The model is also fine-tuned for metric depth estimation and semantic segmentation, achieving state-of-the-art results. The paper concludes that the model is a promising solution for robust monocular depth estimation and can serve as a generic
Reach us at info@study.space