This paper introduces PVT v2, an improved version of the Pyramid Vision Transformer (PVT v1), which enhances the original model with three key design changes: (1) a linear complexity attention layer, (2) overlapping patch embedding, and (3) a convolutional feed-forward network. These modifications reduce the computational complexity of PVT v1 to linear and significantly improve performance on tasks like classification, detection, and segmentation. PVT v2 achieves comparable or better results than recent models such as Swin Transformer. The improved framework, PVT v2, is evaluated on various tasks, including image classification, object detection, and semantic segmentation. On ImageNet, PVT v2-B5 achieves 83.8% top-1 accuracy, outperforming Swin-B and Twins-SVT-L. In object detection, PVT v2-B4 achieves 46.1 AP on RetinaNet and 47.5 AP on Mask R-CNN, surpassing PVT v1. In semantic segmentation, PVT v2 outperforms PVT v1 and other models on the ADE20K dataset. The model also demonstrates lower computational costs compared to other models. The ablation study shows that each of the three design improvements contributes to better performance, with the linear spatial reduction attention layer significantly reducing computational overhead. Overall, PVT v2 provides a more efficient and effective baseline for vision transformers.This paper introduces PVT v2, an improved version of the Pyramid Vision Transformer (PVT v1), which enhances the original model with three key design changes: (1) a linear complexity attention layer, (2) overlapping patch embedding, and (3) a convolutional feed-forward network. These modifications reduce the computational complexity of PVT v1 to linear and significantly improve performance on tasks like classification, detection, and segmentation. PVT v2 achieves comparable or better results than recent models such as Swin Transformer. The improved framework, PVT v2, is evaluated on various tasks, including image classification, object detection, and semantic segmentation. On ImageNet, PVT v2-B5 achieves 83.8% top-1 accuracy, outperforming Swin-B and Twins-SVT-L. In object detection, PVT v2-B4 achieves 46.1 AP on RetinaNet and 47.5 AP on Mask R-CNN, surpassing PVT v1. In semantic segmentation, PVT v2 outperforms PVT v1 and other models on the ADE20K dataset. The model also demonstrates lower computational costs compared to other models. The ablation study shows that each of the three design improvements contributes to better performance, with the linear spatial reduction attention layer significantly reducing computational overhead. Overall, PVT v2 provides a more efficient and effective baseline for vision transformers.