[slides and audio] Swin Transformer V2%3A Scaling Up Capacity and Resolution

This paper addresses the challenges of training large-scale vision models, focusing on three key issues: training instability, resolution gaps between pre-training and fine-tuning, and the high demand for labeled data. To tackle these issues, the authors propose three main techniques: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) a log-spaced continuous position bias method to effectively transfer models pre-trained on low-resolution images to downstream tasks with high-resolution inputs; 3) a self-supervised pre-training method, SimMM, to reduce the need for vast labeled images. These techniques enable the successful training of a 3 billion-parameter Swin Transformer V2 model, which is the largest dense vision model to date. The model achieves state-of-the-art performance on four representative vision benchmarks: ImageNet-V2 image classification, COCO object detection, ADE20K semantic segmentation, and Kinetics-400 video action classification. The training process is significantly more efficient than previous billion-level visual models, consuming 40 times less labeled data and 40 times less training time. The paper also provides detailed experimental settings and ablation studies to support the proposed techniques.This paper addresses the challenges of training large-scale vision models, focusing on three key issues: training instability, resolution gaps between pre-training and fine-tuning, and the high demand for labeled data. To tackle these issues, the authors propose three main techniques: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) a log-spaced continuous position bias method to effectively transfer models pre-trained on low-resolution images to downstream tasks with high-resolution inputs; 3) a self-supervised pre-training method, SimMM, to reduce the need for vast labeled images. These techniques enable the successful training of a 3 billion-parameter Swin Transformer V2 model, which is the largest dense vision model to date. The model achieves state-of-the-art performance on four representative vision benchmarks: ImageNet-V2 image classification, COCO object detection, ADE20K semantic segmentation, and Kinetics-400 video action classification. The training process is significantly more efficient than previous billion-level visual models, consuming 40 times less labeled data and 40 times less training time. The paper also provides detailed experimental settings and ablation studies to support the proposed techniques.

Swin Transformer V2: Scaling Up Capacity and Resolution

2022 | Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo