17 Aug 2021 | Ze Liu†*, Yutong Lin†*, Yue Cao*, Han Hu†‡, Yixuan Wei† Zheng Zhang Stephen Lin Baining Guo
The Swin Transformer is a new vision Transformer designed as a general-purpose backbone for computer vision tasks. It addresses challenges in adapting Transformers from language to vision by introducing a hierarchical structure with shifted windows for self-attention computation. This approach allows for efficient, linear computational complexity with respect to image size, enabling the model to handle various vision tasks such as image classification, object detection, and semantic segmentation. The Swin Transformer outperforms previous state-of-the-art models on tasks like COCO object detection and ADE20K semantic segmentation, achieving higher accuracy and efficiency. It also benefits all-MLP architectures. The model is publicly available and has shown strong performance across multiple vision tasks, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and shifted window approach are effective for both vision and language modeling.The Swin Transformer is a new vision Transformer designed as a general-purpose backbone for computer vision tasks. It addresses challenges in adapting Transformers from language to vision by introducing a hierarchical structure with shifted windows for self-attention computation. This approach allows for efficient, linear computational complexity with respect to image size, enabling the model to handle various vision tasks such as image classification, object detection, and semantic segmentation. The Swin Transformer outperforms previous state-of-the-art models on tasks like COCO object detection and ADE20K semantic segmentation, achieving higher accuracy and efficiency. It also benefits all-MLP architectures. The model is publicly available and has shown strong performance across multiple vision tasks, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and shifted window approach are effective for both vision and language modeling.