DepthFM: Fast Generative Monocular Depth Estimation with Flow Matching

DepthFM: Fast Generative Monocular Depth Estimation with Flow Matching

2024 | Ming Gui*, Johannes Schusterbauer*, Ulrich Prestel, Pingchuan Ma, Dmytro Kotovenko, Olga Grebenkova, Stefan Andreas Baumann, Tao Hu, Björn Ommer
DepthFM is a fast and efficient generative monocular depth estimation model that addresses the challenges of blurry artifacts in discriminative methods and slow sampling in generative approaches. The model frames depth estimation as a direct transport between image and depth distributions using flow matching, which enhances training and sampling efficiency while preserving high performance. By integrating external knowledge from a pre-trained image diffusion model, DepthFM reduces dependency on large training data and improves transferability across different objectives. It also leverages synthetic data generated by a discriminative model on real-world images to further boost performance. As a generative model, DepthFM can reliably estimate depth confidence, providing an additional advantage. The model achieves competitive zero-shot performance on standard benchmarks of complex natural scenes while improving sampling efficiency and requiring minimal synthetic data for training. DepthFM is trained on synthetic datasets such as Hypersim and Virtual KITTI, covering both indoor and outdoor scenes. It is evaluated on established real-world depth estimation benchmarks including NYUv2, KITTI, ETH3D, ScanNet, and DIODE. The model demonstrates strong performance in depth estimation, with high accuracy and efficiency. It outperforms other generative and discriminative models in terms of depth estimation accuracy and speed, achieving state-of-the-art results on benchmark datasets. The model also excels in depth completion tasks, achieving high-quality depth maps with minimal fine-tuning. The model's approach combines the strengths of both discriminative and generative depth estimation by using a pre-trained image diffusion model as a prior and a discriminative depth model as a teacher. This dual knowledge transfer improves training efficiency and performance. The model's ability to provide ensembles of depth predictions enhances confidence estimation and allows for more accurate depth maps. The model's performance is further validated through ablation studies, showing that the use of image and depth priors significantly improves zero-shot performance on NYUv2. The model's approach is also effective in depth completion tasks, achieving high-quality depth maps with minimal fine-tuning. Overall, DepthFM provides a fast, efficient, and accurate solution for monocular depth estimation.DepthFM is a fast and efficient generative monocular depth estimation model that addresses the challenges of blurry artifacts in discriminative methods and slow sampling in generative approaches. The model frames depth estimation as a direct transport between image and depth distributions using flow matching, which enhances training and sampling efficiency while preserving high performance. By integrating external knowledge from a pre-trained image diffusion model, DepthFM reduces dependency on large training data and improves transferability across different objectives. It also leverages synthetic data generated by a discriminative model on real-world images to further boost performance. As a generative model, DepthFM can reliably estimate depth confidence, providing an additional advantage. The model achieves competitive zero-shot performance on standard benchmarks of complex natural scenes while improving sampling efficiency and requiring minimal synthetic data for training. DepthFM is trained on synthetic datasets such as Hypersim and Virtual KITTI, covering both indoor and outdoor scenes. It is evaluated on established real-world depth estimation benchmarks including NYUv2, KITTI, ETH3D, ScanNet, and DIODE. The model demonstrates strong performance in depth estimation, with high accuracy and efficiency. It outperforms other generative and discriminative models in terms of depth estimation accuracy and speed, achieving state-of-the-art results on benchmark datasets. The model also excels in depth completion tasks, achieving high-quality depth maps with minimal fine-tuning. The model's approach combines the strengths of both discriminative and generative depth estimation by using a pre-trained image diffusion model as a prior and a discriminative depth model as a teacher. This dual knowledge transfer improves training efficiency and performance. The model's ability to provide ensembles of depth predictions enhances confidence estimation and allows for more accurate depth maps. The model's performance is further validated through ablation studies, showing that the use of image and depth priors significantly improves zero-shot performance on NYUv2. The model's approach is also effective in depth completion tasks, achieving high-quality depth maps with minimal fine-tuning. Overall, DepthFM provides a fast, efficient, and accurate solution for monocular depth estimation.
Reach us at info@study.space
[slides and audio] DepthFM%3A Fast Monocular Depth Estimation with Flow Matching