Swin Transformer V2: Scaling Up Capacity and Resolution

Swin Transformer V2: Scaling Up Capacity and Resolution

2022 | Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo
Swin Transformer V2 is a large-scale vision model that significantly improves performance on various vision tasks. The paper introduces three key techniques to address challenges in training and application of large vision models: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) a log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; and 3) a self-supervised pre-training method, SimMIM, to reduce the need for large labeled datasets. These techniques enable the training of a 3 billion-parameter Swin Transformer V2 model, which can handle images up to 1,536 × 1,536 resolution. The model achieves state-of-the-art results on four representative vision tasks: ImageNet-V2 image classification, COCO object detection, ADE20K semantic segmentation, and Kinetics-400 video action classification. The training process is more efficient than that of Google's billion-level visual models, consuming 40 times less labeled data and 40 times less training time. The model is trained using Nvidia A100-40G GPUs and achieves high accuracy on various benchmarks, surpassing previous records. The paper also discusses the implementation details for saving GPU memory, including the use of zero-optimizer, activation check-pointing, and sequential self-attention computation. The Swin Transformer V2 architecture is named after the original Swin Transformer, with modifications to improve capacity and resolution scaling. The model's performance on various tasks demonstrates the effectiveness of the proposed techniques in scaling up vision models.Swin Transformer V2 is a large-scale vision model that significantly improves performance on various vision tasks. The paper introduces three key techniques to address challenges in training and application of large vision models: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) a log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; and 3) a self-supervised pre-training method, SimMIM, to reduce the need for large labeled datasets. These techniques enable the training of a 3 billion-parameter Swin Transformer V2 model, which can handle images up to 1,536 × 1,536 resolution. The model achieves state-of-the-art results on four representative vision tasks: ImageNet-V2 image classification, COCO object detection, ADE20K semantic segmentation, and Kinetics-400 video action classification. The training process is more efficient than that of Google's billion-level visual models, consuming 40 times less labeled data and 40 times less training time. The model is trained using Nvidia A100-40G GPUs and achieves high accuracy on various benchmarks, surpassing previous records. The paper also discusses the implementation details for saving GPU memory, including the use of zero-optimizer, activation check-pointing, and sequential self-attention computation. The Swin Transformer V2 architecture is named after the original Swin Transformer, with modifications to improve capacity and resolution scaling. The model's performance on various tasks demonstrates the effectiveness of the proposed techniques in scaling up vision models.
Reach us at info@study.space