BEiT is a self-supervised vision representation model that uses masked image modeling (MIM) to pretrain vision Transformers. Inspired by BERT, BEiT introduces a task where image patches are masked and the model predicts the corresponding visual tokens. The model first tokenizes images into discrete visual tokens using a discrete variational autoencoder (dVAE), then masks some patches and uses the remaining patches to predict the masked ones. This approach allows the model to learn semantic regions without human annotations.
BEiT is pretrained on the ImageNet-1K dataset and fine-tuned on downstream tasks such as image classification and semantic segmentation. Experimental results show that BEiT outperforms both from-scratch training and previous self-supervised methods. It also complements supervised pre-training, with further improvements achieved through intermediate fine-tuning with ImageNet labels. Ablation studies confirm the effectiveness of the MIM task and the importance of visual tokens in pre-training.
The model's self-attention mechanism enables it to distinguish semantic regions and object boundaries without manual annotations. BEiT achieves strong performance on image classification and semantic segmentation tasks, with improvements in convergence speed and stability during fine-tuning. The model is scalable, with larger versions outperforming smaller ones, and is effective even with limited labeled data. BEiT's approach to self-supervised learning offers a promising solution for data-hungry vision Transformers, achieving competitive results with previous methods.BEiT is a self-supervised vision representation model that uses masked image modeling (MIM) to pretrain vision Transformers. Inspired by BERT, BEiT introduces a task where image patches are masked and the model predicts the corresponding visual tokens. The model first tokenizes images into discrete visual tokens using a discrete variational autoencoder (dVAE), then masks some patches and uses the remaining patches to predict the masked ones. This approach allows the model to learn semantic regions without human annotations.
BEiT is pretrained on the ImageNet-1K dataset and fine-tuned on downstream tasks such as image classification and semantic segmentation. Experimental results show that BEiT outperforms both from-scratch training and previous self-supervised methods. It also complements supervised pre-training, with further improvements achieved through intermediate fine-tuning with ImageNet labels. Ablation studies confirm the effectiveness of the MIM task and the importance of visual tokens in pre-training.
The model's self-attention mechanism enables it to distinguish semantic regions and object boundaries without manual annotations. BEiT achieves strong performance on image classification and semantic segmentation tasks, with improvements in convergence speed and stability during fine-tuning. The model is scalable, with larger versions outperforming smaller ones, and is effective even with limited labeled data. BEiT's approach to self-supervised learning offers a promising solution for data-hungry vision Transformers, achieving competitive results with previous methods.