24 Mar 2021 | René Ranftl, Alexey Bochkovskiy, Vladlen Koltun
The paper introduces the Dense Vision Transformers (DPT), an architecture that leverages vision transformers as a backbone for dense prediction tasks. DPT combines tokens from various stages of the vision transformer into image-like representations at different resolutions and uses a convolutional decoder to generate full-resolution predictions. The transformer backbone maintains a constant and high-resolution representation throughout, allowing for finer-grained and more globally coherent predictions compared to fully-convolutional networks. Experiments on monocular depth estimation and semantic segmentation show significant improvements, especially with large training datasets. DPT achieves up to 28% relative performance improvement over state-of-the-art fully-convolutional networks in monocular depth estimation and sets new state-of-the-art results on ADE20K and Pascal Context for semantic segmentation. The architecture is also effective on smaller datasets like NYUv2 and KITTI. The paper discusses the architecture's design, including the transformer encoder and convolutional decoder, and provides detailed experimental results and ablation studies.The paper introduces the Dense Vision Transformers (DPT), an architecture that leverages vision transformers as a backbone for dense prediction tasks. DPT combines tokens from various stages of the vision transformer into image-like representations at different resolutions and uses a convolutional decoder to generate full-resolution predictions. The transformer backbone maintains a constant and high-resolution representation throughout, allowing for finer-grained and more globally coherent predictions compared to fully-convolutional networks. Experiments on monocular depth estimation and semantic segmentation show significant improvements, especially with large training datasets. DPT achieves up to 28% relative performance improvement over state-of-the-art fully-convolutional networks in monocular depth estimation and sets new state-of-the-art results on ADE20K and Pascal Context for semantic segmentation. The architecture is also effective on smaller datasets like NYUv2 and KITTI. The paper discusses the architecture's design, including the transformer encoder and convolutional decoder, and provides detailed experimental results and ablation studies.