WorDepth: Variational Language Prior for Monocular Depth Estimation

WorDepth: Variational Language Prior for Monocular Depth Estimation

2 Jun 2024 | Ziyao Zeng, Daniel Wang, Fengyu Yang, Hyoungseob Park, Yangchao Wu, Stefano Soatto, Byung-Woo Hong, Dong Lao, Alex Wong
WorDepth is a variational language prior approach for monocular depth estimation that combines text descriptions with image data to produce metric-scaled depth maps. The method uses a variational autoencoder (VAE) to encode text captions into a mean and standard deviation of plausible depth maps, and a conditional sampler to generate depth maps from the image. The approach alternates between training the text-VAE and the conditional sampler to optimize the depth estimation. The text-VAE learns the distribution of depth maps corresponding to text descriptions, while the conditional sampler selects a specific depth map based on the image. The method is trained on indoor (NYUv2) and outdoor (KITTI) datasets, achieving state-of-the-art performance. The approach leverages language as a prior to improve depth estimation by incorporating semantic information from text descriptions, leading to more accurate and metric-scaled depth maps. The method is evaluated on multiple metrics, showing significant improvements over baseline methods. The results demonstrate that language can effectively reduce the ambiguity in monocular depth estimation, leading to more accurate depth predictions. The method is also tested for zero-shot generalization, showing strong performance across different datasets. The approach is efficient and can be applied to various real-world scenarios, making it a promising solution for monocular depth estimation.WorDepth is a variational language prior approach for monocular depth estimation that combines text descriptions with image data to produce metric-scaled depth maps. The method uses a variational autoencoder (VAE) to encode text captions into a mean and standard deviation of plausible depth maps, and a conditional sampler to generate depth maps from the image. The approach alternates between training the text-VAE and the conditional sampler to optimize the depth estimation. The text-VAE learns the distribution of depth maps corresponding to text descriptions, while the conditional sampler selects a specific depth map based on the image. The method is trained on indoor (NYUv2) and outdoor (KITTI) datasets, achieving state-of-the-art performance. The approach leverages language as a prior to improve depth estimation by incorporating semantic information from text descriptions, leading to more accurate and metric-scaled depth maps. The method is evaluated on multiple metrics, showing significant improvements over baseline methods. The results demonstrate that language can effectively reduce the ambiguity in monocular depth estimation, leading to more accurate depth predictions. The method is also tested for zero-shot generalization, showing strong performance across different datasets. The approach is efficient and can be applied to various real-world scenarios, making it a promising solution for monocular depth estimation.
Reach us at info@study.space