SimMIM is a simple framework for masked image modeling. The paper presents SimMIM, which simplifies recent approaches by avoiding complex designs such as block-wise masking and tokenization via discrete VAE or clustering. The framework systematically studies its components and finds that simple designs achieve strong representation learning performance. Random masking with a moderately large patch size (e.g., 32) creates a powerful pre-text task. Predicting raw pixel values via regression performs no worse than complex classification approaches. A lightweight prediction head, such as a linear layer, achieves performance comparable to heavier heads. Using ViT-B, SimMIM achieves 83.8% top-1 accuracy on ImageNet-1K, surpassing previous best approaches by +0.6%. When applied to a larger model (SwinV2-H), it achieves 87.1% top-1 accuracy on ImageNet-1K using only ImageNet-1K data. SimMIM also addresses the data-hungry issue in large-scale model training, successfully training a 3B model (SwinV2-G) with 40× less labeled data than JFT-3B, achieving state-of-the-art results on multiple benchmarks. The framework is simple, effective, and scalable, making it a promising approach for representation learning.SimMIM is a simple framework for masked image modeling. The paper presents SimMIM, which simplifies recent approaches by avoiding complex designs such as block-wise masking and tokenization via discrete VAE or clustering. The framework systematically studies its components and finds that simple designs achieve strong representation learning performance. Random masking with a moderately large patch size (e.g., 32) creates a powerful pre-text task. Predicting raw pixel values via regression performs no worse than complex classification approaches. A lightweight prediction head, such as a linear layer, achieves performance comparable to heavier heads. Using ViT-B, SimMIM achieves 83.8% top-1 accuracy on ImageNet-1K, surpassing previous best approaches by +0.6%. When applied to a larger model (SwinV2-H), it achieves 87.1% top-1 accuracy on ImageNet-1K using only ImageNet-1K data. SimMIM also addresses the data-hungry issue in large-scale model training, successfully training a 3B model (SwinV2-G) with 40× less labeled data than JFT-3B, achieving state-of-the-art results on multiple benchmarks. The framework is simple, effective, and scalable, making it a promising approach for representation learning.