Vision Transformers for Dense Prediction

Vision Transformers for Dense Prediction

24 Mar 2021 | René Ranftl, Alexey Bochkovskiy, Vladlen Koltun
This paper introduces a dense vision transformer (DPT) architecture that replaces convolutional networks with vision transformers for dense prediction tasks. The DPT architecture uses a transformer as the backbone, which processes image representations at a constant resolution with a global receptive field. The transformer's output is then combined with a convolutional decoder to produce dense predictions. The DPT architecture outperforms fully-convolutional networks in tasks such as monocular depth estimation and semantic segmentation. For monocular depth estimation, DPT achieves a 28% improvement over a state-of-the-art fully-convolutional network. On the ADE20K dataset, DPT achieves a 49.02% mIoU, setting a new state of the art in semantic segmentation. The architecture is also effective on smaller datasets such as NYUv2, KITTI, and Pascal Context. The DPT architecture is available at https://github.com/intel-isl/DPT. The paper also discusses the advantages of using vision transformers for dense prediction, including their ability to maintain fine-grained and globally coherent predictions. The DPT architecture is evaluated on multiple tasks and datasets, showing its effectiveness in both large-scale and small-scale settings. The paper concludes that DPT provides significant improvements in dense prediction tasks when compared to fully-convolutional networks.This paper introduces a dense vision transformer (DPT) architecture that replaces convolutional networks with vision transformers for dense prediction tasks. The DPT architecture uses a transformer as the backbone, which processes image representations at a constant resolution with a global receptive field. The transformer's output is then combined with a convolutional decoder to produce dense predictions. The DPT architecture outperforms fully-convolutional networks in tasks such as monocular depth estimation and semantic segmentation. For monocular depth estimation, DPT achieves a 28% improvement over a state-of-the-art fully-convolutional network. On the ADE20K dataset, DPT achieves a 49.02% mIoU, setting a new state of the art in semantic segmentation. The architecture is also effective on smaller datasets such as NYUv2, KITTI, and Pascal Context. The DPT architecture is available at https://github.com/intel-isl/DPT. The paper also discusses the advantages of using vision transformers for dense prediction, including their ability to maintain fine-grained and globally coherent predictions. The DPT architecture is evaluated on multiple tasks and datasets, showing its effectiveness in both large-scale and small-scale settings. The paper concludes that DPT provides significant improvements in dense prediction tasks when compared to fully-convolutional networks.
Reach us at info@study.space
Understanding Vision Transformers for Dense Prediction