19 Dec 2021 | Kaiming He*,† Xinlei Chen* Saining Xie Yanghao Li Piotr Dollár Ross Girshick
Masked Autoencoders (MAE) are effective and scalable self-supervised learners for computer vision. The approach involves masking random patches of an image and reconstructing the missing pixels. The key design includes an asymmetric encoder-decoder architecture, where the encoder processes only the visible patches, and the decoder reconstructs the image using latent representations and mask tokens. Masking a high proportion (e.g., 75%) of the image creates a challenging self-supervisory task, enabling efficient training and improved accuracy. MAE allows for learning high-capacity models that generalize well, achieving 87.8% accuracy on ImageNet-1K with a vanilla ViT-Huge model. Transfer performance in downstream tasks outperforms supervised pretraining and shows promising scaling behavior.
The MAE approach is simple, efficient, and scalable. It uses a lightweight decoder that reconstructs the image from latent representations and mask tokens, reducing computational costs. The method is effective for various tasks, including object detection, instance segmentation, and semantic segmentation. It outperforms previous methods in transfer learning and demonstrates strong performance in both linear probing and fine-tuning. The MAE also works well with minimal data augmentation and is more efficient than other self-supervised methods like BEiT.
The study shows that MAE can be trained efficiently on large-scale datasets, achieving better performance than supervised pretraining. The method is particularly effective for large models, with MAE pre-training leading to significant improvements in accuracy and efficiency. The results indicate that MAE can be a powerful tool for self-supervised learning in computer vision, offering scalable benefits and strong performance in various tasks. The method's simplicity and efficiency make it a promising approach for future research and applications in vision learning.Masked Autoencoders (MAE) are effective and scalable self-supervised learners for computer vision. The approach involves masking random patches of an image and reconstructing the missing pixels. The key design includes an asymmetric encoder-decoder architecture, where the encoder processes only the visible patches, and the decoder reconstructs the image using latent representations and mask tokens. Masking a high proportion (e.g., 75%) of the image creates a challenging self-supervisory task, enabling efficient training and improved accuracy. MAE allows for learning high-capacity models that generalize well, achieving 87.8% accuracy on ImageNet-1K with a vanilla ViT-Huge model. Transfer performance in downstream tasks outperforms supervised pretraining and shows promising scaling behavior.
The MAE approach is simple, efficient, and scalable. It uses a lightweight decoder that reconstructs the image from latent representations and mask tokens, reducing computational costs. The method is effective for various tasks, including object detection, instance segmentation, and semantic segmentation. It outperforms previous methods in transfer learning and demonstrates strong performance in both linear probing and fine-tuning. The MAE also works well with minimal data augmentation and is more efficient than other self-supervised methods like BEiT.
The study shows that MAE can be trained efficiently on large-scale datasets, achieving better performance than supervised pretraining. The method is particularly effective for large models, with MAE pre-training leading to significant improvements in accuracy and efficiency. The results indicate that MAE can be a powerful tool for self-supervised learning in computer vision, offering scalable benefits and strong performance in various tasks. The method's simplicity and efficiency make it a promising approach for future research and applications in vision learning.