PVT v2: Improved Baselines with Pyramid Vision Transformer

PVT v2: Improved Baselines with Pyramid Vision Transformer

17 Apr 2023 | Wenhai Wang1,2, Enze Xie3, Xiang Li4, Deng-Ping Fan5, Kaitao Song4, Ding Liang6, Tong Lu2, Ping Luo3, Ling Shao5
This paper introduces PVT v2, an improved version of the Pyramid Vision Transformer (PVT v1), which enhances the original model with three key design changes: (1) a linear complexity attention layer, (2) overlapping patch embedding, and (3) a convolutional feed-forward network. These modifications reduce the computational complexity of PVT v1 to linear and significantly improve performance on tasks like classification, detection, and segmentation. PVT v2 achieves comparable or better results than recent models such as Swin Transformer. The improved framework, PVT v2, is evaluated on various tasks, including image classification, object detection, and semantic segmentation. On ImageNet, PVT v2-B5 achieves 83.8% top-1 accuracy, outperforming Swin-B and Twins-SVT-L. In object detection, PVT v2-B4 achieves 46.1 AP on RetinaNet and 47.5 AP on Mask R-CNN, surpassing PVT v1. In semantic segmentation, PVT v2 outperforms PVT v1 and other models on the ADE20K dataset. The model also demonstrates lower computational costs compared to other models. The ablation study shows that each of the three design improvements contributes to better performance, with the linear spatial reduction attention layer significantly reducing computational overhead. Overall, PVT v2 provides a more efficient and effective baseline for vision transformers.This paper introduces PVT v2, an improved version of the Pyramid Vision Transformer (PVT v1), which enhances the original model with three key design changes: (1) a linear complexity attention layer, (2) overlapping patch embedding, and (3) a convolutional feed-forward network. These modifications reduce the computational complexity of PVT v1 to linear and significantly improve performance on tasks like classification, detection, and segmentation. PVT v2 achieves comparable or better results than recent models such as Swin Transformer. The improved framework, PVT v2, is evaluated on various tasks, including image classification, object detection, and semantic segmentation. On ImageNet, PVT v2-B5 achieves 83.8% top-1 accuracy, outperforming Swin-B and Twins-SVT-L. In object detection, PVT v2-B4 achieves 46.1 AP on RetinaNet and 47.5 AP on Mask R-CNN, surpassing PVT v1. In semantic segmentation, PVT v2 outperforms PVT v1 and other models on the ADE20K dataset. The model also demonstrates lower computational costs compared to other models. The ablation study shows that each of the three design improvements contributes to better performance, with the linear spatial reduction attention layer significantly reducing computational overhead. Overall, PVT v2 provides a more efficient and effective baseline for vision transformers.
Reach us at info@study.space
Understanding PVT v2%3A Improved baselines with Pyramid Vision Transformer