ECoDepth is a novel approach for monocular depth estimation that leverages diffusion models conditioned on embeddings from a pre-trained Vision Transformer (ViT) model. The method aims to improve the performance of single-image depth estimation (SIDE) by utilizing semantic context derived from ViT embeddings, which are more informative than traditional pseudo-image captions. The proposed model, based on a diffusion backbone, achieves state-of-the-art results on the NYU Depth v2 and KITTI datasets, with significant improvements in depth estimation metrics compared to existing methods. On NYU Depth v2, the model achieves an absolute relative error of 0.059 (14% improvement) compared to the current SOTA (VPD) of 0.069. On KITTI, it achieves a square relative error of 0.139 (2% improvement) compared to the current SOTA (GED) of 0.142. The model also demonstrates strong zero-shot transfer performance, outperforming existing methods on multiple unseen datasets. The key contribution of the paper is the use of ViT embeddings to provide semantic context for diffusion models, which leads to more accurate depth estimation. The model is implemented using PyTorch and trained on the NYU Depth v2 dataset, achieving high performance on both indoor and outdoor depth estimation tasks. The results show that the proposed method significantly improves depth estimation accuracy and generalization across various datasets.ECoDepth is a novel approach for monocular depth estimation that leverages diffusion models conditioned on embeddings from a pre-trained Vision Transformer (ViT) model. The method aims to improve the performance of single-image depth estimation (SIDE) by utilizing semantic context derived from ViT embeddings, which are more informative than traditional pseudo-image captions. The proposed model, based on a diffusion backbone, achieves state-of-the-art results on the NYU Depth v2 and KITTI datasets, with significant improvements in depth estimation metrics compared to existing methods. On NYU Depth v2, the model achieves an absolute relative error of 0.059 (14% improvement) compared to the current SOTA (VPD) of 0.069. On KITTI, it achieves a square relative error of 0.139 (2% improvement) compared to the current SOTA (GED) of 0.142. The model also demonstrates strong zero-shot transfer performance, outperforming existing methods on multiple unseen datasets. The key contribution of the paper is the use of ViT embeddings to provide semantic context for diffusion models, which leads to more accurate depth estimation. The model is implemented using PyTorch and trained on the NYU Depth v2 dataset, achieving high performance on both indoor and outdoor depth estimation tasks. The results show that the proposed method significantly improves depth estimation accuracy and generalization across various datasets.