25 Jan 2024 | Letian Fu, Long Lian, Renhao Wang, Baifeng Shi, Xudong Wang, Adam Yala, Trevor Darrell, Alexei A. Efros, Ken Goldberg
This paper re-examines the inter-patch dependencies in the decoding mechanism of masked autoencoders (MAE), proposing a novel framework called Cross-Attention Masked Autoencoders (CrossMAE). The authors decompose the decoding process into self-attention and cross-attention, suggesting that self-attention between mask patches is not essential for learning good representations. CrossMAE leverages only cross-attention between masked and visible tokens, achieving similar or better performance than MAE while reducing computational costs by 2.5 to 3.7 times. The design also enables partial reconstruction, allowing the decoding of only a subset of mask tokens, which boosts efficiency. Additionally, each decoder block can use different encoder features, enhancing representation learning. CrossMAE matches MAE's performance on ImageNet classification and COCO instance segmentation, demonstrating its effectiveness and efficiency. The paper includes experimental results, ablation studies, and visualizations to support these findings.This paper re-examines the inter-patch dependencies in the decoding mechanism of masked autoencoders (MAE), proposing a novel framework called Cross-Attention Masked Autoencoders (CrossMAE). The authors decompose the decoding process into self-attention and cross-attention, suggesting that self-attention between mask patches is not essential for learning good representations. CrossMAE leverages only cross-attention between masked and visible tokens, achieving similar or better performance than MAE while reducing computational costs by 2.5 to 3.7 times. The design also enables partial reconstruction, allowing the decoding of only a subset of mask tokens, which boosts efficiency. Additionally, each decoder block can use different encoder features, enhancing representation learning. CrossMAE matches MAE's performance on ImageNet classification and COCO instance segmentation, demonstrating its effectiveness and efficiency. The paper includes experimental results, ablation studies, and visualizations to support these findings.