BEiT: BERT Pre-Training of Image Transformers

BEiT: BERT Pre-Training of Image Transformers

3 Sep 2022 | Hangbo Bao†*, Li Dong†, Songhao Piao†, Furu Wei†
BEiT: BERT Pre-Training of Image Transformers **Authors:** Hangbo Bao, Li Dong, Songhao Piao, Furu Wei **Institution:** Harbin Institute of Technology, Microsoft Research **Abstract:** BEiT (Bidirectional Encoder representation from Image Transformers) is a self-supervised vision representation model inspired by BERT. It introduces a masked image modeling task to pretrain vision Transformers. Each image is tokenized into visual tokens, and some patches are randomly masked. The pre-training objective is to recover the original visual tokens based on the corrupted image patches. After pre-training, BEiT is fine-tuned on downstream tasks such as image classification and semantic segmentation. Experimental results show that BEiT achieves competitive performance with previous pre-training methods and outperforms from-scratch training and strong self-supervised models. Ablation studies demonstrate the effectiveness of the proposed techniques. BEiT also learns to distinguish semantic regions and object boundaries without human annotation. **Key Contributions:** - Introduction of a masked image modeling (MIM) task for pretraining vision Transformers. - Extensive fine-tuning experiments on downstream tasks. - Demonstration of BEiT's ability to learn semantic regions and object boundaries. - Ablation studies showing the importance of the proposed techniques. **Methods:** - **Image Representations:** Images are split into patches and tokenized into visual tokens. - **Backbone Network:** A standard Transformer is used as the backbone network. - **Pre-Training BEiT:** MIM task involves predicting visual tokens from corrupted image patches. - **From the Perspective of Variational Autoencoder:** BEiT pre-training can be viewed as variational autoencoder training. - **Pre-Training Setup:** BEiT is pre-trained on ImageNet-1K with random initialization and no labels. - **Fine-Tuning BEiT:** Task layers are appended to the Transformer for fine-tuning on downstream tasks. **Experiments:** - **Image Classification:** BEiT outperforms random initialization, supervised pre-training, and previous self-supervised methods. - **Semantic Segmentation:** BEiT achieves better performance than supervised pre-training on ADE20K. - **Ablation Studies:** Blockwise masking and visual tokens are crucial for BEiT's effectiveness. - **Analysis of Self-Attention Map:** BEiT can separate objects using self-attention mechanisms without manual annotations. **Related Work:** - Self-supervised visual representation learning methods, including contrastive learning and clustering. - Self-supervised vision Transformers, such as iGPT and ViT. **Conclusion:** BEiT introduces a self-supervised pre-training framework for vision Transformers, achieving strong performance on downstream tasks. It demonstrates the effectiveness of BERT-like pre-training for image Transformers and shows intriguing properties of learned semantic regions. Future work includes scaling up BEiT pre-training and exploring multimodal pre-training.BEiT: BERT Pre-Training of Image Transformers **Authors:** Hangbo Bao, Li Dong, Songhao Piao, Furu Wei **Institution:** Harbin Institute of Technology, Microsoft Research **Abstract:** BEiT (Bidirectional Encoder representation from Image Transformers) is a self-supervised vision representation model inspired by BERT. It introduces a masked image modeling task to pretrain vision Transformers. Each image is tokenized into visual tokens, and some patches are randomly masked. The pre-training objective is to recover the original visual tokens based on the corrupted image patches. After pre-training, BEiT is fine-tuned on downstream tasks such as image classification and semantic segmentation. Experimental results show that BEiT achieves competitive performance with previous pre-training methods and outperforms from-scratch training and strong self-supervised models. Ablation studies demonstrate the effectiveness of the proposed techniques. BEiT also learns to distinguish semantic regions and object boundaries without human annotation. **Key Contributions:** - Introduction of a masked image modeling (MIM) task for pretraining vision Transformers. - Extensive fine-tuning experiments on downstream tasks. - Demonstration of BEiT's ability to learn semantic regions and object boundaries. - Ablation studies showing the importance of the proposed techniques. **Methods:** - **Image Representations:** Images are split into patches and tokenized into visual tokens. - **Backbone Network:** A standard Transformer is used as the backbone network. - **Pre-Training BEiT:** MIM task involves predicting visual tokens from corrupted image patches. - **From the Perspective of Variational Autoencoder:** BEiT pre-training can be viewed as variational autoencoder training. - **Pre-Training Setup:** BEiT is pre-trained on ImageNet-1K with random initialization and no labels. - **Fine-Tuning BEiT:** Task layers are appended to the Transformer for fine-tuning on downstream tasks. **Experiments:** - **Image Classification:** BEiT outperforms random initialization, supervised pre-training, and previous self-supervised methods. - **Semantic Segmentation:** BEiT achieves better performance than supervised pre-training on ADE20K. - **Ablation Studies:** Blockwise masking and visual tokens are crucial for BEiT's effectiveness. - **Analysis of Self-Attention Map:** BEiT can separate objects using self-attention mechanisms without manual annotations. **Related Work:** - Self-supervised visual representation learning methods, including contrastive learning and clustering. - Self-supervised vision Transformers, such as iGPT and ViT. **Conclusion:** BEiT introduces a self-supervised pre-training framework for vision Transformers, achieving strong performance on downstream tasks. It demonstrates the effectiveness of BERT-like pre-training for image Transformers and shows intriguing properties of learned semantic regions. Future work includes scaling up BEiT pre-training and exploring multimodal pre-training.
Reach us at info@study.space