BEiT: BERT Pre-Training of Image Transformers
**Authors:** Hangbo Bao, Li Dong, Songhao Piao, Furu Wei
**Institution:** Harbin Institute of Technology, Microsoft Research
**Abstract:**
BEiT (Bidirectional Encoder representation from Image Transformers) is a self-supervised vision representation model inspired by BERT. It introduces a masked image modeling task to pretrain vision Transformers. Each image is tokenized into visual tokens, and some patches are randomly masked. The pre-training objective is to recover the original visual tokens based on the corrupted image patches. After pre-training, BEiT is fine-tuned on downstream tasks such as image classification and semantic segmentation. Experimental results show that BEiT achieves competitive performance with previous pre-training methods and outperforms from-scratch training and strong self-supervised models. Ablation studies demonstrate the effectiveness of the proposed techniques. BEiT also learns to distinguish semantic regions and object boundaries without human annotation.
**Key Contributions:**
- Introduction of a masked image modeling (MIM) task for pretraining vision Transformers.
- Extensive fine-tuning experiments on downstream tasks.
- Demonstration of BEiT's ability to learn semantic regions and object boundaries.
- Ablation studies showing the importance of the proposed techniques.
**Methods:**
- **Image Representations:** Images are split into patches and tokenized into visual tokens.
- **Backbone Network:** A standard Transformer is used as the backbone network.
- **Pre-Training BEiT:** MIM task involves predicting visual tokens from corrupted image patches.
- **From the Perspective of Variational Autoencoder:** BEiT pre-training can be viewed as variational autoencoder training.
- **Pre-Training Setup:** BEiT is pre-trained on ImageNet-1K with random initialization and no labels.
- **Fine-Tuning BEiT:** Task layers are appended to the Transformer for fine-tuning on downstream tasks.
**Experiments:**
- **Image Classification:** BEiT outperforms random initialization, supervised pre-training, and previous self-supervised methods.
- **Semantic Segmentation:** BEiT achieves better performance than supervised pre-training on ADE20K.
- **Ablation Studies:** Blockwise masking and visual tokens are crucial for BEiT's effectiveness.
- **Analysis of Self-Attention Map:** BEiT can separate objects using self-attention mechanisms without manual annotations.
**Related Work:**
- Self-supervised visual representation learning methods, including contrastive learning and clustering.
- Self-supervised vision Transformers, such as iGPT and ViT.
**Conclusion:**
BEiT introduces a self-supervised pre-training framework for vision Transformers, achieving strong performance on downstream tasks. It demonstrates the effectiveness of BERT-like pre-training for image Transformers and shows intriguing properties of learned semantic regions. Future work includes scaling up BEiT pre-training and exploring multimodal pre-training.BEiT: BERT Pre-Training of Image Transformers
**Authors:** Hangbo Bao, Li Dong, Songhao Piao, Furu Wei
**Institution:** Harbin Institute of Technology, Microsoft Research
**Abstract:**
BEiT (Bidirectional Encoder representation from Image Transformers) is a self-supervised vision representation model inspired by BERT. It introduces a masked image modeling task to pretrain vision Transformers. Each image is tokenized into visual tokens, and some patches are randomly masked. The pre-training objective is to recover the original visual tokens based on the corrupted image patches. After pre-training, BEiT is fine-tuned on downstream tasks such as image classification and semantic segmentation. Experimental results show that BEiT achieves competitive performance with previous pre-training methods and outperforms from-scratch training and strong self-supervised models. Ablation studies demonstrate the effectiveness of the proposed techniques. BEiT also learns to distinguish semantic regions and object boundaries without human annotation.
**Key Contributions:**
- Introduction of a masked image modeling (MIM) task for pretraining vision Transformers.
- Extensive fine-tuning experiments on downstream tasks.
- Demonstration of BEiT's ability to learn semantic regions and object boundaries.
- Ablation studies showing the importance of the proposed techniques.
**Methods:**
- **Image Representations:** Images are split into patches and tokenized into visual tokens.
- **Backbone Network:** A standard Transformer is used as the backbone network.
- **Pre-Training BEiT:** MIM task involves predicting visual tokens from corrupted image patches.
- **From the Perspective of Variational Autoencoder:** BEiT pre-training can be viewed as variational autoencoder training.
- **Pre-Training Setup:** BEiT is pre-trained on ImageNet-1K with random initialization and no labels.
- **Fine-Tuning BEiT:** Task layers are appended to the Transformer for fine-tuning on downstream tasks.
**Experiments:**
- **Image Classification:** BEiT outperforms random initialization, supervised pre-training, and previous self-supervised methods.
- **Semantic Segmentation:** BEiT achieves better performance than supervised pre-training on ADE20K.
- **Ablation Studies:** Blockwise masking and visual tokens are crucial for BEiT's effectiveness.
- **Analysis of Self-Attention Map:** BEiT can separate objects using self-attention mechanisms without manual annotations.
**Related Work:**
- Self-supervised visual representation learning methods, including contrastive learning and clustering.
- Self-supervised vision Transformers, such as iGPT and ViT.
**Conclusion:**
BEiT introduces a self-supervised pre-training framework for vision Transformers, achieving strong performance on downstream tasks. It demonstrates the effectiveness of BERT-like pre-training for image Transformers and shows intriguing properties of learned semantic regions. Future work includes scaling up BEiT pre-training and exploring multimodal pre-training.