Understanding CSWin Transformer%3A A General Vision Transformer Backbone with Cross-Shaped Windows

CSWin Transformer is an efficient and effective Transformer-based backbone designed for general-purpose vision tasks. It addresses the challenge of balancing global self-attention, which is computationally expensive, with local self-attention, which limits the interaction range of each token. The key innovation is the Cross-Shaped Window (CSWin) self-attention mechanism, which performs self-attention in horizontal and vertical stripes in parallel, forming a cross-shaped window. This mechanism allows for efficient computation while maintaining strong modeling capabilities. The stripe width is adjusted according to the depth of the network, balancing learning capacity and computation cost. Additionally, Locally-enhanced Positional Encoding (LePE) is introduced to handle local positional information better than existing encoding schemes, making CSWin Transformer more effective for downstream tasks like object detection and segmentation. CSWin Transformer demonstrates competitive performance on various vision tasks, achieving high accuracy on ImageNet-1K, COCO detection, and ADE20K semantic segmentation, surpassing previous state-of-the-art Swin Transformer by significant margins. Pretraining on the larger ImageNet-21K dataset further improves performance, achieving 87.5% Top-1 accuracy on ImageNet-1K and high segmentation performance on ADE20K.CSWin Transformer is an efficient and effective Transformer-based backbone designed for general-purpose vision tasks. It addresses the challenge of balancing global self-attention, which is computationally expensive, with local self-attention, which limits the interaction range of each token. The key innovation is the Cross-Shaped Window (CSWin) self-attention mechanism, which performs self-attention in horizontal and vertical stripes in parallel, forming a cross-shaped window. This mechanism allows for efficient computation while maintaining strong modeling capabilities. The stripe width is adjusted according to the depth of the network, balancing learning capacity and computation cost. Additionally, Locally-enhanced Positional Encoding (LePE) is introduced to handle local positional information better than existing encoding schemes, making CSWin Transformer more effective for downstream tasks like object detection and segmentation. CSWin Transformer demonstrates competitive performance on various vision tasks, achieving high accuracy on ImageNet-1K, COCO detection, and ADE20K semantic segmentation, surpassing previous state-of-the-art Swin Transformer by significant margins. Pretraining on the larger ImageNet-21K dataset further improves performance, achieving 87.5% Top-1 accuracy on ImageNet-1K and high segmentation performance on ADE20K.

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

9 Jan 2022 | Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, Baining Guo