Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

11 Aug 2021 | Wenhai Wang1, Enze Xie2, Xiang Li3, Deng-Ping Fan4*, Kaitao Song5, Ding Liang5, Tong Lu1*, Ping Luo2, Ling Shao4
The Pyramid Vision Transformer (PVT) is a novel backbone network designed for dense prediction tasks such as object detection and semantic segmentation. Unlike traditional Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), PVT combines the advantages of both CNNs and Transformers, offering a versatile and efficient solution for various computer vision tasks. PVT incorporates a pyramid structure to generate multi-scale feature maps, which is crucial for dense prediction tasks. It also introduces a progressive shrinking pyramid to reduce computational costs and a spatial-reduction attention (SRA) layer to handle high-resolution features more efficiently. PVT has been shown to outperform state-of-the-art CNN backbones like ResNet and ResNeXt in multiple downstream tasks, including object detection and semantic segmentation. Additionally, PVT can be easily integrated with other task-specific Transformer decoders, such as DETR, to build end-to-end object detection systems without convolutions. The paper provides extensive experimental results and ablation studies to demonstrate the effectiveness and flexibility of PVT.The Pyramid Vision Transformer (PVT) is a novel backbone network designed for dense prediction tasks such as object detection and semantic segmentation. Unlike traditional Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), PVT combines the advantages of both CNNs and Transformers, offering a versatile and efficient solution for various computer vision tasks. PVT incorporates a pyramid structure to generate multi-scale feature maps, which is crucial for dense prediction tasks. It also introduces a progressive shrinking pyramid to reduce computational costs and a spatial-reduction attention (SRA) layer to handle high-resolution features more efficiently. PVT has been shown to outperform state-of-the-art CNN backbones like ResNet and ResNeXt in multiple downstream tasks, including object detection and semantic segmentation. Additionally, PVT can be easily integrated with other task-specific Transformer decoders, such as DETR, to build end-to-end object detection systems without convolutions. The paper provides extensive experimental results and ablation studies to demonstrate the effectiveness and flexibility of PVT.
Reach us at info@study.space
[slides and audio] Pyramid Vision Transformer%3A A Versatile Backbone for Dense Prediction without Convolutions