Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

11 Aug 2021 | Wenhai Wang1, Enze Xie2, Xiang Li3, Deng-Ping Fan4*, Kaitao Song5, Ding Liang5, Tong Lu1*, Ping Luo2, Ling Shao4
Pyramid Vision Transformer (PVT) is a convolution-free backbone for dense prediction tasks, combining the strengths of CNNs and Transformers. Unlike Vision Transformer (ViT), which is designed for image classification, PVT incorporates a pyramid structure from CNNs to enable high-resolution and multi-scale feature maps, making it suitable for tasks like object detection, instance segmentation, and semantic segmentation. PVT uses a progressive shrinking pyramid and a spatial-reduction attention (SRA) layer to reduce computational costs while maintaining high performance. Experiments show that PVT outperforms CNN-based models like ResNet and ResNeXt in tasks such as object detection and semantic segmentation. For example, PVT-Small achieves 40.4 AP on COCO val2017, surpassing ResNet50 by 4.1 AP. PVT is also compatible with DETR to build an end-to-end object detection system without convolutions. PVT's design allows it to be used in various downstream tasks, including image classification, object detection, and semantic segmentation. The model is efficient, with lower computational costs compared to ViT and other CNNs, and can handle high-resolution feature maps. PVT's performance is validated across multiple benchmarks, demonstrating its effectiveness as a versatile backbone for dense prediction tasks.Pyramid Vision Transformer (PVT) is a convolution-free backbone for dense prediction tasks, combining the strengths of CNNs and Transformers. Unlike Vision Transformer (ViT), which is designed for image classification, PVT incorporates a pyramid structure from CNNs to enable high-resolution and multi-scale feature maps, making it suitable for tasks like object detection, instance segmentation, and semantic segmentation. PVT uses a progressive shrinking pyramid and a spatial-reduction attention (SRA) layer to reduce computational costs while maintaining high performance. Experiments show that PVT outperforms CNN-based models like ResNet and ResNeXt in tasks such as object detection and semantic segmentation. For example, PVT-Small achieves 40.4 AP on COCO val2017, surpassing ResNet50 by 4.1 AP. PVT is also compatible with DETR to build an end-to-end object detection system without convolutions. PVT's design allows it to be used in various downstream tasks, including image classification, object detection, and semantic segmentation. The model is efficient, with lower computational costs compared to ViT and other CNNs, and can handle high-resolution feature maps. PVT's performance is validated across multiple benchmarks, demonstrating its effectiveness as a versatile backbone for dense prediction tasks.
Reach us at info@study.space