16 Aug 2021 | Xinlei Chen*, Saining Xie*, Kaiming He
This paper investigates the training of Vision Transformers (ViT) using self-supervised learning frameworks, focusing on the challenges and stability issues that arise. The authors explore fundamental components such as batch size, learning rate, and optimizer, and observe that instability is a significant issue affecting accuracy. They find that unstable training can lead to mild degradation in accuracy, which may not be noticeable unless compared with more stable methods. To address this, they propose a simple trick of freezing the patch projection layer, which improves stability and increases accuracy in various scenarios. The study benchmarks ViT results in several self-supervised frameworks, including MoCo v3, SimCLR, BYOL, and SwAV, and discusses the positive evidence, challenges, and open questions in self-supervised ViT training. The findings highlight the importance of stability in self-supervised learning and provide valuable insights for future research.This paper investigates the training of Vision Transformers (ViT) using self-supervised learning frameworks, focusing on the challenges and stability issues that arise. The authors explore fundamental components such as batch size, learning rate, and optimizer, and observe that instability is a significant issue affecting accuracy. They find that unstable training can lead to mild degradation in accuracy, which may not be noticeable unless compared with more stable methods. To address this, they propose a simple trick of freezing the patch projection layer, which improves stability and increases accuracy in various scenarios. The study benchmarks ViT results in several self-supervised frameworks, including MoCo v3, SimCLR, BYOL, and SwAV, and discusses the positive evidence, challenges, and open questions in self-supervised ViT training. The findings highlight the importance of stability in self-supervised learning and provide valuable insights for future research.