PVT v2: Improved Baselines with Pyramid Vision Transformer

PVT v2: Improved Baselines with Pyramid Vision Transformer

17 Apr 2023 | Wenhai Wang1,2, Enze Xie3, Xiang Li4, Deng-Ping Fan5, Kaitao Song4, Ding Liang6, Tong Lu2, Ping Luo3, Ling Shao5
The paper "PVT v2: Improved Baselines with Pyramid Vision Transformer" by Wenhai Wang et al. introduces significant improvements to the Pyramid Vision Transformer (PVT v1) by adding three key designs: linear complexity attention layer, overlapping patch embedding, and convolutional feed-forward network. These modifications reduce the computational complexity of PVT v1 to linear, achieving substantial improvements in fundamental vision tasks such as classification, detection, and segmentation. Notably, PVT v2 outperforms recent works like Swin Transformer while having fewer parameters and GFLOPs. The authors report that PVT v2-B5 achieves 83.8% top-1 accuracy on ImageNet, surpassing Swin-B and Twins-SVT-L. In object detection, PVT v2-B4 achieves 46.1 AP on COCO, outperforming RetinaNet and Mask R-CNN. In semantic segmentation, PVT v2-B1/B2/B3/B4 achieve at least 5.3% higher mIoU than PVT v1-Tiny/Small/Medium/Large on ADE20K. The paper also includes ablation studies and computational overhead analyses to validate the effectiveness of each design. The improved baselines are expected to facilitate future research in vision Transformers.The paper "PVT v2: Improved Baselines with Pyramid Vision Transformer" by Wenhai Wang et al. introduces significant improvements to the Pyramid Vision Transformer (PVT v1) by adding three key designs: linear complexity attention layer, overlapping patch embedding, and convolutional feed-forward network. These modifications reduce the computational complexity of PVT v1 to linear, achieving substantial improvements in fundamental vision tasks such as classification, detection, and segmentation. Notably, PVT v2 outperforms recent works like Swin Transformer while having fewer parameters and GFLOPs. The authors report that PVT v2-B5 achieves 83.8% top-1 accuracy on ImageNet, surpassing Swin-B and Twins-SVT-L. In object detection, PVT v2-B4 achieves 46.1 AP on COCO, outperforming RetinaNet and Mask R-CNN. In semantic segmentation, PVT v2-B1/B2/B3/B4 achieve at least 5.3% higher mIoU than PVT v1-Tiny/Small/Medium/Large on ADE20K. The paper also includes ablation studies and computational overhead analyses to validate the effectiveness of each design. The improved baselines are expected to facilitate future research in vision Transformers.
Reach us at info@study.space
[slides] PVT v2%3A Improved baselines with Pyramid Vision Transformer | StudySpace