Twins: Revisiting the Design of Spatial Attention in Vision Transformers

Twins: Revisiting the Design of Spatial Attention in Vision Transformers

30 Sep 2021 | Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haibing Ren, Xiaolin Wei, Huaxia Xia, Chunhua Shen
This paper presents two efficient vision transformer architectures, Twins-PCPVT and Twins-SVT, which achieve strong performance on various visual tasks. The authors revisit the design of spatial attention in vision transformers and propose a simple yet effective spatial attention mechanism. Twins-PCPVT is based on PVT and CPVT, using conditional positional encodings to improve performance. Twins-SVT introduces a spatially separable self-attention (SSSA) mechanism that combines locally-grouped self-attention (LSA) and global sub-sampled attention (GSA). This design allows for efficient computation and effective global context modeling. The proposed architectures are highly efficient, involving only matrix multiplications that are well-optimized in modern deep learning frameworks. Extensive experiments show that both architectures outperform state-of-the-art vision transformers in terms of performance and efficiency. Twins-PCPVT achieves comparable performance to Swin on image classification tasks, while Twins-SVT achieves state-of-the-art results on semantic segmentation. The architectures are also efficient and easy to implement, making them suitable for deployment on various platforms. The code is available at https://git.io/Twins.This paper presents two efficient vision transformer architectures, Twins-PCPVT and Twins-SVT, which achieve strong performance on various visual tasks. The authors revisit the design of spatial attention in vision transformers and propose a simple yet effective spatial attention mechanism. Twins-PCPVT is based on PVT and CPVT, using conditional positional encodings to improve performance. Twins-SVT introduces a spatially separable self-attention (SSSA) mechanism that combines locally-grouped self-attention (LSA) and global sub-sampled attention (GSA). This design allows for efficient computation and effective global context modeling. The proposed architectures are highly efficient, involving only matrix multiplications that are well-optimized in modern deep learning frameworks. Extensive experiments show that both architectures outperform state-of-the-art vision transformers in terms of performance and efficiency. Twins-PCPVT achieves comparable performance to Swin on image classification tasks, while Twins-SVT achieves state-of-the-art results on semantic segmentation. The architectures are also efficient and easy to implement, making them suitable for deployment on various platforms. The code is available at https://git.io/Twins.
Reach us at info@study.space
[slides and audio] Twins%3A Revisiting the Design of Spatial Attention in Vision Transformers