SimMIM: a Simple Framework for Masked Image Modeling

SimMIM: a Simple Framework for Masked Image Modeling

| Zhenda Xie1*, Zheng Zhang2*, Yue Cao2*, Yutong Lin3, Jianmin Bao2, Zhuliang Yao1, Qi Dai2, Han Hu2*
This paper introduces SimMIM, a simple and effective framework for masked image modeling. The framework simplifies previous complex designs, such as block-wise masking and tokenization via discrete VAE or clustering, by using a random masking strategy with moderately large patch sizes, predicting raw pixel values through a lightweight linear layer, and employing an $\ell_1$ loss. The key insights include: 1. **Random Masking**: Averaging the distance of masked pixels to the nearest visible pixels (AvgDist) is crucial for effective representation learning. A moderate AvgDist (10-20) is optimal. 2. **Prediction Head**: A lightweight linear layer performs as well as heavier heads, reducing training costs. 3. **Prediction Target**: Predicting raw pixel values directly is more effective than classification-based approaches, aligning with the continuous nature of visual signals. SimMIM achieves state-of-the-art performance on ImageNet-1K with ViT-B, achieving 83.8% top-1 accuracy, and surpasses previous best approaches by +0.6%. It also scales well to larger models, achieving 87.1% top-1 accuracy on ImageNet-1K with SwinV2-H. Additionally, SimMIM is used to train a 3B parameter SwinV2-G model with significantly less labeled data, achieving strong performance on multiple vision benchmarks. The paper includes ablation studies, experiments with different architectures, and visualizations to support these findings.This paper introduces SimMIM, a simple and effective framework for masked image modeling. The framework simplifies previous complex designs, such as block-wise masking and tokenization via discrete VAE or clustering, by using a random masking strategy with moderately large patch sizes, predicting raw pixel values through a lightweight linear layer, and employing an $\ell_1$ loss. The key insights include: 1. **Random Masking**: Averaging the distance of masked pixels to the nearest visible pixels (AvgDist) is crucial for effective representation learning. A moderate AvgDist (10-20) is optimal. 2. **Prediction Head**: A lightweight linear layer performs as well as heavier heads, reducing training costs. 3. **Prediction Target**: Predicting raw pixel values directly is more effective than classification-based approaches, aligning with the continuous nature of visual signals. SimMIM achieves state-of-the-art performance on ImageNet-1K with ViT-B, achieving 83.8% top-1 accuracy, and surpasses previous best approaches by +0.6%. It also scales well to larger models, achieving 87.1% top-1 accuracy on ImageNet-1K with SwinV2-H. Additionally, SimMIM is used to train a 3B parameter SwinV2-G model with significantly less labeled data, achieving strong performance on multiple vision benchmarks. The paper includes ablation studies, experiments with different architectures, and visualizations to support these findings.
Reach us at info@study.space