[slides and audio] Masked Autoencoders Are Scalable Vision Learners

This paper introduces a masked autoencoder (MAE) approach for self-supervised learning in computer vision. The MAE method involves masking random patches of the input image and reconstructing the missing pixels using an asymmetric encoder-decoder architecture. The encoder operates only on the visible patches, while the decoder reconstructs the original image from the latent representation and mask tokens. The high masking ratio (e.g., 75%) creates a challenging self-supervisory task, enabling efficient and effective training of large models. The MAE approach achieves improved accuracy and generalization compared to previous methods, with a vanilla ViT-Huge model achieving 87.8% accuracy on ImageNet-1K. The paper also demonstrates the scalability and transferability of the MAE pre-trained models in downstream tasks such as object detection, instance segmentation, and semantic segmentation. The results show that the MAE method can outperform supervised pre-training and other self-supervised methods, making it a promising approach for large-scale vision learning.This paper introduces a masked autoencoder (MAE) approach for self-supervised learning in computer vision. The MAE method involves masking random patches of the input image and reconstructing the missing pixels using an asymmetric encoder-decoder architecture. The encoder operates only on the visible patches, while the decoder reconstructs the original image from the latent representation and mask tokens. The high masking ratio (e.g., 75%) creates a challenging self-supervisory task, enabling efficient and effective training of large models. The MAE approach achieves improved accuracy and generalization compared to previous methods, with a vanilla ViT-Huge model achieving 87.8% accuracy on ImageNet-1K. The paper also demonstrates the scalability and transferability of the MAE pre-trained models in downstream tasks such as object detection, instance segmentation, and semantic segmentation. The results show that the MAE method can outperform supervised pre-training and other self-supervised methods, making it a promising approach for large-scale vision learning.

Masked Autoencoders Are Scalable Vision Learners

19 Dec 2021 | Kaiming He*,† Xinlei Chen* Saining Xie Yanghao Li Piotr Dollár Ross Girshick

19 Dec 2021 | Kaiming He,† Xinlei Chen Saining Xie Yanghao Li Piotr Dollár Ross Girshick