Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

17 Aug 2021 | Ze Liu†*, Yutong Lin†*, Yue Cao*, Han Hu†‡, Yixuan Wei† Zheng Zhang Stephen Lin Baining Guo
This paper introduces the Swin Transformer, a novel vision Transformer designed to serve as a general-purpose backbone for computer vision tasks. The Swin Transformer addresses the challenges of adapting Transformers from natural language processing (NLP) to computer vision by introducing a hierarchical architecture and a shifted windowing scheme. The hierarchical design allows the model to capture features at various scales, while the shifted windowing scheme limits self-attention computation to non-overlapping local windows, reducing computational complexity to linear with respect to image size. This approach enables efficient and flexible modeling of visual entities of varying scales and high-resolution pixels. The Swin Transformer achieves state-of-the-art performance on several vision tasks, including image classification (87.3% top-1 accuracy on ImageNet-1K), object detection (58.7 box AP and 51.1 mask AP on COCO test-dev), and semantic segmentation (53.5 mIoU on ADE20K val). The paper also discusses the benefits of the shifted window approach for all-MLP architectures and provides detailed experimental results and ablation studies to support the design choices. The code and models are publicly available at <https://github.com/microsoft/Swin-Transformer>.This paper introduces the Swin Transformer, a novel vision Transformer designed to serve as a general-purpose backbone for computer vision tasks. The Swin Transformer addresses the challenges of adapting Transformers from natural language processing (NLP) to computer vision by introducing a hierarchical architecture and a shifted windowing scheme. The hierarchical design allows the model to capture features at various scales, while the shifted windowing scheme limits self-attention computation to non-overlapping local windows, reducing computational complexity to linear with respect to image size. This approach enables efficient and flexible modeling of visual entities of varying scales and high-resolution pixels. The Swin Transformer achieves state-of-the-art performance on several vision tasks, including image classification (87.3% top-1 accuracy on ImageNet-1K), object detection (58.7 box AP and 51.1 mask AP on COCO test-dev), and semantic segmentation (53.5 mIoU on ADE20K val). The paper also discusses the benefits of the shifted window approach for all-MLP architectures and provides detailed experimental results and ablation studies to support the design choices. The code and models are publicly available at <https://github.com/microsoft/Swin-Transformer>.
Reach us at info@study.space
[slides and audio] Swin Transformer%3A Hierarchical Vision Transformer using Shifted Windows