This paper revisits the design of spatial attention in Vision Transformers and proposes two new architectures, Twins-PCPVT and Twins-SVT, which achieve excellent performance on various visual tasks. The authors find that the global sub-sampled attention in Pyramid Vision Transformer (PVT) is highly effective and can match or even surpass state-of-the-art models like Swin Transformer when appropriate positional encodings are used. They introduce a new attention mechanism called Spatially Separable Self-Attention (SSSA), which combines locally-grouped self-attention (LSA) and global sub-sampled attention (GSA). This mechanism is efficient and easy to implement, involving only matrix multiplications. The proposed architectures are benchmarked on tasks such as image classification, semantic segmentation, and object detection, showing superior performance with reduced computational complexity compared to other state-of-the-art models. The simplicity and strong performance of the proposed architectures suggest they could serve as powerful backbones for many vision tasks.This paper revisits the design of spatial attention in Vision Transformers and proposes two new architectures, Twins-PCPVT and Twins-SVT, which achieve excellent performance on various visual tasks. The authors find that the global sub-sampled attention in Pyramid Vision Transformer (PVT) is highly effective and can match or even surpass state-of-the-art models like Swin Transformer when appropriate positional encodings are used. They introduce a new attention mechanism called Spatially Separable Self-Attention (SSSA), which combines locally-grouped self-attention (LSA) and global sub-sampled attention (GSA). This mechanism is efficient and easy to implement, involving only matrix multiplications. The proposed architectures are benchmarked on tasks such as image classification, semantic segmentation, and object detection, showing superior performance with reduced computational complexity compared to other state-of-the-art models. The simplicity and strong performance of the proposed architectures suggest they could serve as powerful backbones for many vision tasks.