25 Jan 2024 | Letian Fu*, Long Lian*, Renhao Wang, Baifeng Shi, Xudong Wang, Adam Yala1,2, Trevor Darrell†, Alexei A. Efros†, Ken Goldberg†
This paper re-examines the decoding mechanism in masked autoencoders (MAE) and proposes Cross-Attention Masked Autoencoders (CrossMAE). The study investigates whether self-attention between mask patches is essential for learning good representations. The results suggest that self-attention is not necessary, and CrossMAE, which uses cross-attention between masked and visible tokens, achieves comparable performance to MAE with significantly less computational cost. CrossMAE enables efficient decoding of only a subset of mask tokens, improving pretraining efficiency. Additionally, each decoder block can leverage different encoder features, enhancing representation learning. CrossMAE matches MAE in performance with 2.5 to 3.7× less decoding compute and surpasses MAE on ImageNet classification and COCO instance segmentation under the same compute. The paper also explores the benefits of partial reconstruction and inter-block attention, which allow different decoder blocks to focus on different encoder features, improving representation learning. CrossMAE achieves 83.5% classification accuracy on ImageNet, surpassing its full-reconstruction MAE counterpart. In object detection and instance segmentation on COCO, CrossMAE achieves 52.1 AP and 46.3 AP, again surpassing MAE. The paper also shows that CrossMAE scales better with ViT-L. The study highlights that self-attention among mask tokens is not essential for learning good representations, and that cross-attention can be used effectively for masked autoencoders. The paper concludes that CrossMAE provides a more efficient and effective approach to masked autoencoding, with potential for further improvements in self-attention and cross-attention methods.This paper re-examines the decoding mechanism in masked autoencoders (MAE) and proposes Cross-Attention Masked Autoencoders (CrossMAE). The study investigates whether self-attention between mask patches is essential for learning good representations. The results suggest that self-attention is not necessary, and CrossMAE, which uses cross-attention between masked and visible tokens, achieves comparable performance to MAE with significantly less computational cost. CrossMAE enables efficient decoding of only a subset of mask tokens, improving pretraining efficiency. Additionally, each decoder block can leverage different encoder features, enhancing representation learning. CrossMAE matches MAE in performance with 2.5 to 3.7× less decoding compute and surpasses MAE on ImageNet classification and COCO instance segmentation under the same compute. The paper also explores the benefits of partial reconstruction and inter-block attention, which allow different decoder blocks to focus on different encoder features, improving representation learning. CrossMAE achieves 83.5% classification accuracy on ImageNet, surpassing its full-reconstruction MAE counterpart. In object detection and instance segmentation on COCO, CrossMAE achieves 52.1 AP and 46.3 AP, again surpassing MAE. The paper also shows that CrossMAE scales better with ViT-L. The study highlights that self-attention among mask tokens is not essential for learning good representations, and that cross-attention can be used effectively for masked autoencoders. The paper concludes that CrossMAE provides a more efficient and effective approach to masked autoencoding, with potential for further improvements in self-attention and cross-attention methods.