MetaFormer Is Actually What You Need for Vision

MetaFormer Is Actually What You Need for Vision

4 Jul 2022 | Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, Shuicheng Yan
MetaFormer is a general architecture derived from Transformers without specifying the token mixer. The paper argues that the success of Transformer/MLP-like models primarily stems from the general architecture MetaFormer rather than specific token mixers. To demonstrate this, the authors use an embarrassingly simple non-parametric operator, pooling, for token mixing. The resulting model, PoolFormer, consistently outperforms well-tuned vision Transformer and MLP-like baselines, achieving 82.1% top-1 accuracy on ImageNet-1K with fewer parameters and MACs. This supports the claim that MetaFormer is actually what is needed for competitive performance in vision tasks. The paper also evaluates PoolFormer on multiple vision tasks, including object detection, instance segmentation, and semantic segmentation, showing competitive performance compared to state-of-the-art models. The results indicate that MetaFormer is the key to achieving superior results for recent Transformer and MLP-like models on vision tasks. The paper calls for more future research dedicated to improving MetaFormer rather than focusing on specific token mixer modules.MetaFormer is a general architecture derived from Transformers without specifying the token mixer. The paper argues that the success of Transformer/MLP-like models primarily stems from the general architecture MetaFormer rather than specific token mixers. To demonstrate this, the authors use an embarrassingly simple non-parametric operator, pooling, for token mixing. The resulting model, PoolFormer, consistently outperforms well-tuned vision Transformer and MLP-like baselines, achieving 82.1% top-1 accuracy on ImageNet-1K with fewer parameters and MACs. This supports the claim that MetaFormer is actually what is needed for competitive performance in vision tasks. The paper also evaluates PoolFormer on multiple vision tasks, including object detection, instance segmentation, and semantic segmentation, showing competitive performance compared to state-of-the-art models. The results indicate that MetaFormer is the key to achieving superior results for recent Transformer and MLP-like models on vision tasks. The paper calls for more future research dedicated to improving MetaFormer rather than focusing on specific token mixer modules.
Reach us at info@study.space
Understanding MetaFormer is Actually What You Need for Vision