2024 | Ming Gui*, Johannes Schusterbauer*, Ulrich Prestel, Pingchuan Ma, Dmytro Kotovenko, Olga Grebenkova, Stefan Andreas Baumann, Tao Hu, Björn Ommer
DepthFM is a novel generative model for monocular depth estimation that addresses the challenges of blurry artifacts in discriminative methods and slow sampling in generative approaches. By framing depth estimation as a direct transport between image and depth distributions, DepthFM leverages flow matching to enhance sampling efficiency and performance. The method uses flow matching to model the transport trajectory directly from image to depth space, avoiding the need to denoise Gaussian distributions in depth maps. This approach is faster and more efficient than current diffusion-based solutions.
To improve training and data efficiency, DepthFM incorporates external knowledge from pre-trained image diffusion models and discriminative depth estimation models. The image diffusion model provides a robust image prior, while the discriminative model offers a depth prior, enhancing the generative model's performance. DepthFM also employs synthetic data and image-depth pairs generated by a discriminative model on real-world images to further boost its effectiveness.
Experiments demonstrate that DepthFM achieves competitive zero-shot performance on standard benchmarks of complex natural scenes, improving sampling efficiency and requiring minimal synthetic data for training. The model can estimate depth confidence, providing an additional advantage. DepthFM outperforms other generative and discriminative methods in terms of sampling speed, depth fidelity, and edge precision and recall. It also shows strong generalization to real-world data with varying resolutions and aspect ratios, making it a versatile and efficient solution for monocular depth estimation.DepthFM is a novel generative model for monocular depth estimation that addresses the challenges of blurry artifacts in discriminative methods and slow sampling in generative approaches. By framing depth estimation as a direct transport between image and depth distributions, DepthFM leverages flow matching to enhance sampling efficiency and performance. The method uses flow matching to model the transport trajectory directly from image to depth space, avoiding the need to denoise Gaussian distributions in depth maps. This approach is faster and more efficient than current diffusion-based solutions.
To improve training and data efficiency, DepthFM incorporates external knowledge from pre-trained image diffusion models and discriminative depth estimation models. The image diffusion model provides a robust image prior, while the discriminative model offers a depth prior, enhancing the generative model's performance. DepthFM also employs synthetic data and image-depth pairs generated by a discriminative model on real-world images to further boost its effectiveness.
Experiments demonstrate that DepthFM achieves competitive zero-shot performance on standard benchmarks of complex natural scenes, improving sampling efficiency and requiring minimal synthetic data for training. The model can estimate depth confidence, providing an additional advantage. DepthFM outperforms other generative and discriminative methods in terms of sampling speed, depth fidelity, and edge precision and recall. It also shows strong generalization to real-world data with varying resolutions and aspect ratios, making it a versatile and efficient solution for monocular depth estimation.