The paper "PVT v2: Improved Baselines with Pyramid Vision Transformer" by Wenhai Wang et al. introduces significant improvements to the Pyramid Vision Transformer (PVT v1) by adding three key designs: linear complexity attention layer, overlapping patch embedding, and convolutional feed-forward network. These modifications reduce the computational complexity of PVT v1 to linear, achieving substantial improvements in fundamental vision tasks such as classification, detection, and segmentation. Notably, PVT v2 outperforms recent works like Swin Transformer while having fewer parameters and GFLOPs. The authors report that PVT v2-B5 achieves 83.8% top-1 accuracy on ImageNet, surpassing Swin-B and Twins-SVT-L. In object detection, PVT v2-B4 achieves 46.1 AP on COCO, outperforming RetinaNet and Mask R-CNN. In semantic segmentation, PVT v2-B1/B2/B3/B4 achieve at least 5.3% higher mIoU than PVT v1-Tiny/Small/Medium/Large on ADE20K. The paper also includes ablation studies and computational overhead analyses to validate the effectiveness of each design. The improved baselines are expected to facilitate future research in vision Transformers.The paper "PVT v2: Improved Baselines with Pyramid Vision Transformer" by Wenhai Wang et al. introduces significant improvements to the Pyramid Vision Transformer (PVT v1) by adding three key designs: linear complexity attention layer, overlapping patch embedding, and convolutional feed-forward network. These modifications reduce the computational complexity of PVT v1 to linear, achieving substantial improvements in fundamental vision tasks such as classification, detection, and segmentation. Notably, PVT v2 outperforms recent works like Swin Transformer while having fewer parameters and GFLOPs. The authors report that PVT v2-B5 achieves 83.8% top-1 accuracy on ImageNet, surpassing Swin-B and Twins-SVT-L. In object detection, PVT v2-B4 achieves 46.1 AP on COCO, outperforming RetinaNet and Mask R-CNN. In semantic segmentation, PVT v2-B1/B2/B3/B4 achieve at least 5.3% higher mIoU than PVT v1-Tiny/Small/Medium/Large on ADE20K. The paper also includes ablation studies and computational overhead analyses to validate the effectiveness of each design. The improved baselines are expected to facilitate future research in vision Transformers.