9 Jan 2022 | Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, Baining Guo
CSWin Transformer is a general-purpose vision transformer backbone that efficiently and effectively handles various vision tasks. The key innovation is the Cross-Shaped Window (CSWin) self-attention mechanism, which computes self-attention in parallel across horizontal and vertical stripes, forming a cross-shaped window. This mechanism allows for strong modeling capability while keeping computational costs low. The stripe width is adjusted according to the depth of the network, with smaller widths for shallow layers and larger widths for deeper layers. This design enables efficient long-range interaction modeling.
Additionally, the paper introduces Locally-enhanced Positional Encoding (LePE), which better captures local positional information and is particularly effective for varying input resolutions. CSWin Transformer is evaluated on ImageNet-1K, COCO detection, and ADE20K segmentation tasks. It achieves 85.4% Top-1 accuracy on ImageNet-1K without extra training data, 53.9 box AP and 46.4 mask AP on COCO, and 52.2 mIOU on ADE20K. When pretrained on ImageNet-21K, it achieves 87.5% Top-1 accuracy on ImageNet-1K and 55.7 mIOU on ADE20K. CSWin Transformer outperforms previous state-of-the-art Swin Transformer in multiple metrics under similar FLOPs settings. It also demonstrates competitive performance in object detection and semantic segmentation tasks, with significant improvements over existing methods. The CSWin Transformer is efficient, scalable, and effective for a wide range of vision tasks.CSWin Transformer is a general-purpose vision transformer backbone that efficiently and effectively handles various vision tasks. The key innovation is the Cross-Shaped Window (CSWin) self-attention mechanism, which computes self-attention in parallel across horizontal and vertical stripes, forming a cross-shaped window. This mechanism allows for strong modeling capability while keeping computational costs low. The stripe width is adjusted according to the depth of the network, with smaller widths for shallow layers and larger widths for deeper layers. This design enables efficient long-range interaction modeling.
Additionally, the paper introduces Locally-enhanced Positional Encoding (LePE), which better captures local positional information and is particularly effective for varying input resolutions. CSWin Transformer is evaluated on ImageNet-1K, COCO detection, and ADE20K segmentation tasks. It achieves 85.4% Top-1 accuracy on ImageNet-1K without extra training data, 53.9 box AP and 46.4 mask AP on COCO, and 52.2 mIOU on ADE20K. When pretrained on ImageNet-21K, it achieves 87.5% Top-1 accuracy on ImageNet-1K and 55.7 mIOU on ADE20K. CSWin Transformer outperforms previous state-of-the-art Swin Transformer in multiple metrics under similar FLOPs settings. It also demonstrates competitive performance in object detection and semantic segmentation tasks, with significant improvements over existing methods. The CSWin Transformer is efficient, scalable, and effective for a wide range of vision tasks.