2 Jun 2024 | Ziyao Zeng, Daniel Wang, Fengyu Yang, Hyoungseob Park, Yangchao Wu, Stefano Soatto, Byung-Woo Hong, Dong Lao, Alex Wong
WorDepth: Variational Language Prior for Monocular Depth Estimation
**Abstract:**
This paper addresses the ill-posed problem of monocular depth estimation, where predicting a 3D scene from a single image is challenging due to inherent ambiguities such as scale. We explore whether two inherently ambiguous modalities—text descriptions and images—can be used together to produce metric-scaled reconstructions. Our approach encodes text captions as mean and standard deviation using a variational auto-encoder (VAE) and learns the distribution of plausible metric reconstructions. An image-based conditional sampler then selects a specific depth map by sampling from the latent space of the VAE, conditioned on the given image. The method alternates between optimizing the VAE and the conditional sampler, improving depth estimation accuracy. We demonstrate our approach on indoor (NYUv2) and outdoor (KITTI) datasets, showing consistent improvements in performance.
**Contributions:**
1. We propose WorDepth, a variational framework that leverages the complementary strengths of text and images for monocular depth estimation.
2. We introduce an image-based conditional sampler that models language as a conditional prior.
3. We achieve state-of-the-art performance on both indoor and outdoor benchmarks.
4. We are the first to treat language as a variational prior for monocular depth estimation.
**Related Work:**
- Monocular depth estimation methods often rely on generic priors or specific network designs.
- Variational and generative methods have explored uncertainty in depth estimation.
- Foundation models and vision-language models have been used for monocular depth estimation, but WorDepth explicitly models language as a prior.
**Method:**
- **Text Variational Auto-Encoder (VAE):** Encodes text captions using a pre-trained vision-language model (CLIP) and learns the mean and standard deviation of the latent distribution of plausible scenes.
- **Image-Based Conditional Sampler:** Predicts a noise vector from an image to sample from the latent distribution, selecting the most probable depth map.
- **Training Loss:** Minimizes a combination of scale-invariant and KL divergence losses.
**Experiments:**
- **Datasets:** NYU Depth V2 and KITTI.
- **Network Architecture:** Uses ResNet-50 and Swin-L Transformer for text and image encoders, respectively.
- **Hyperparameters:** Adam optimizer with a cosine learning rate scheduler.
- **Evaluation:** Quantitative results show significant improvements over baselines, especially in threshold accuracy.
**Discussion:**
- Language priors can calibrate the learned scene distribution to true real-world statistics, addressing scale ambiguity in depth estimation.
- WorDepth opens new avenues for 3D reconstruction and extends existing works to metric-scale depth predictions.WorDepth: Variational Language Prior for Monocular Depth Estimation
**Abstract:**
This paper addresses the ill-posed problem of monocular depth estimation, where predicting a 3D scene from a single image is challenging due to inherent ambiguities such as scale. We explore whether two inherently ambiguous modalities—text descriptions and images—can be used together to produce metric-scaled reconstructions. Our approach encodes text captions as mean and standard deviation using a variational auto-encoder (VAE) and learns the distribution of plausible metric reconstructions. An image-based conditional sampler then selects a specific depth map by sampling from the latent space of the VAE, conditioned on the given image. The method alternates between optimizing the VAE and the conditional sampler, improving depth estimation accuracy. We demonstrate our approach on indoor (NYUv2) and outdoor (KITTI) datasets, showing consistent improvements in performance.
**Contributions:**
1. We propose WorDepth, a variational framework that leverages the complementary strengths of text and images for monocular depth estimation.
2. We introduce an image-based conditional sampler that models language as a conditional prior.
3. We achieve state-of-the-art performance on both indoor and outdoor benchmarks.
4. We are the first to treat language as a variational prior for monocular depth estimation.
**Related Work:**
- Monocular depth estimation methods often rely on generic priors or specific network designs.
- Variational and generative methods have explored uncertainty in depth estimation.
- Foundation models and vision-language models have been used for monocular depth estimation, but WorDepth explicitly models language as a prior.
**Method:**
- **Text Variational Auto-Encoder (VAE):** Encodes text captions using a pre-trained vision-language model (CLIP) and learns the mean and standard deviation of the latent distribution of plausible scenes.
- **Image-Based Conditional Sampler:** Predicts a noise vector from an image to sample from the latent distribution, selecting the most probable depth map.
- **Training Loss:** Minimizes a combination of scale-invariant and KL divergence losses.
**Experiments:**
- **Datasets:** NYU Depth V2 and KITTI.
- **Network Architecture:** Uses ResNet-50 and Swin-L Transformer for text and image encoders, respectively.
- **Hyperparameters:** Adam optimizer with a cosine learning rate scheduler.
- **Evaluation:** Quantitative results show significant improvements over baselines, especially in threshold accuracy.
**Discussion:**
- Language priors can calibrate the learned scene distribution to true real-world statistics, addressing scale ambiguity in depth estimation.
- WorDepth opens new avenues for 3D reconstruction and extends existing works to metric-scale depth predictions.