16 Aug 2021 | Xinlei Chen*, Saining Xie*, Kaiming He
This paper investigates the training of self-supervised Vision Transformers (ViT) in the context of self-supervised learning. While the training of standard convolutional networks is well-established, the training of ViT for self-supervised learning remains underdeveloped. The authors analyze the effects of fundamental components in training self-supervised ViT, such as batch size, learning rate, and optimizer, and find that instability is a major issue that can degrade accuracy. They also observe that unstable training may not result in catastrophic failure but can cause mild degradation in accuracy, which may not be noticeable unless compared to a more stable counterpart.
The authors benchmark ViT results in MoCo v3 and other self-supervised frameworks, and discuss the current positive evidence, challenges, and open questions. They propose a simple trick to improve stability by freezing the patch projection layer in ViT, which empirically shows to alleviate instability in several scenarios and consistently increase accuracy. They also explore scaling up ViT models, including ViT-Large and ViT-Huge, and find that self-supervised ViT can achieve strong results using a contrastive learning framework, compared against masked auto-encoding. This behavior of Transformers differs from the existing trend in NLP. Moreover, as a promising signal, their bigger self-supervised ViT can achieve better accuracy, unlike the ImageNet-supervised ViT in [16] whose accuracy degrades if getting bigger.
The authors also report that their self-supervised ViT models have competitive results vs. the big convolutional ResNets in prior art. On one hand, this comparison shows the potential of ViT, especially considering that it achieves these results using relatively "fewer inductive biases". On the other hand, they suggest that there could be room for self-supervised ViT models to further improve. As one example, they observe that removing the position embedding in ViT only degrades accuracy by a small margin. This reveals that self-supervised ViT can learn strong representations without the positional inductive bias, but it also implies that the positional information has not been sufficiently exploited.
In summary, the authors believe that the evidence, challenges, and open questions in this study are worth knowing, if self-supervised Transformers will close the gap in pre-training between vision and language. They hope their data points and experience will be useful to push this frontier.This paper investigates the training of self-supervised Vision Transformers (ViT) in the context of self-supervised learning. While the training of standard convolutional networks is well-established, the training of ViT for self-supervised learning remains underdeveloped. The authors analyze the effects of fundamental components in training self-supervised ViT, such as batch size, learning rate, and optimizer, and find that instability is a major issue that can degrade accuracy. They also observe that unstable training may not result in catastrophic failure but can cause mild degradation in accuracy, which may not be noticeable unless compared to a more stable counterpart.
The authors benchmark ViT results in MoCo v3 and other self-supervised frameworks, and discuss the current positive evidence, challenges, and open questions. They propose a simple trick to improve stability by freezing the patch projection layer in ViT, which empirically shows to alleviate instability in several scenarios and consistently increase accuracy. They also explore scaling up ViT models, including ViT-Large and ViT-Huge, and find that self-supervised ViT can achieve strong results using a contrastive learning framework, compared against masked auto-encoding. This behavior of Transformers differs from the existing trend in NLP. Moreover, as a promising signal, their bigger self-supervised ViT can achieve better accuracy, unlike the ImageNet-supervised ViT in [16] whose accuracy degrades if getting bigger.
The authors also report that their self-supervised ViT models have competitive results vs. the big convolutional ResNets in prior art. On one hand, this comparison shows the potential of ViT, especially considering that it achieves these results using relatively "fewer inductive biases". On the other hand, they suggest that there could be room for self-supervised ViT models to further improve. As one example, they observe that removing the position embedding in ViT only degrades accuracy by a small margin. This reveals that self-supervised ViT can learn strong representations without the positional inductive bias, but it also implies that the positional information has not been sufficiently exploited.
In summary, the authors believe that the evidence, challenges, and open questions in this study are worth knowing, if self-supervised Transformers will close the gap in pre-training between vision and language. They hope their data points and experience will be useful to push this frontier.