MetaFormer Is Actually What You Need for Vision

MetaFormer Is Actually What You Need for Vision

4 Jul 2022 | Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, Shuicheng Yan
MetaFormer is a general architecture abstracted from Transformers, where the token mixer is not specified. This architecture is shown to be essential for achieving competitive performance in vision tasks, even when using a simple non-parametric operator like pooling as the token mixer. The resulting model, PoolFormer, outperforms well-tuned Vision Transformer and MLP-like baselines on ImageNet-1K, achieving 82.1% top-1 accuracy with fewer parameters and MACs. This demonstrates that the general architecture of MetaFormer is more crucial than specific token mixers. PoolFormer also performs well on object detection, instance segmentation, and semantic segmentation tasks, showing its versatility. The experiments highlight that MetaFormer can achieve competitive results with simple token mixers, supporting the idea that it is the key to achieving superior performance in vision tasks. The paper calls for future research focused on improving MetaFormer rather than specific token mixers.MetaFormer is a general architecture abstracted from Transformers, where the token mixer is not specified. This architecture is shown to be essential for achieving competitive performance in vision tasks, even when using a simple non-parametric operator like pooling as the token mixer. The resulting model, PoolFormer, outperforms well-tuned Vision Transformer and MLP-like baselines on ImageNet-1K, achieving 82.1% top-1 accuracy with fewer parameters and MACs. This demonstrates that the general architecture of MetaFormer is more crucial than specific token mixers. PoolFormer also performs well on object detection, instance segmentation, and semantic segmentation tasks, showing its versatility. The experiments highlight that MetaFormer can achieve competitive results with simple token mixers, supporting the idea that it is the key to achieving superior performance in vision tasks. The paper calls for future research focused on improving MetaFormer rather than specific token mixers.
Reach us at info@study.space
[slides and audio] MetaFormer is Actually What You Need for Vision