The paper explores the concept of "Scaling on Scales" (S²), a method that scales vision models by running them on multiple image scales rather than increasing the model size. The authors demonstrate that smaller, pre-trained vision models (e.g., ViT-B or ViT-L) can outperform larger models (e.g., ViT-H or ViT-G) on various tasks such as classification, segmentation, depth estimation, and robotic manipulation. S² achieves state-of-the-art performance in detailed understanding of Multimodal LLMs (MLLMs) on the V* benchmark, surpassing models like GPT-4V. The paper also examines the conditions under which S² is more effective than scaling model size, finding that while larger models generally generalize better on hard examples, the features of larger models can be well approximated by multi-scale smaller models. Pre-training smaller models with S² can match or even exceed the performance of larger models. The authors release a Python package that applies S² to any vision model with a single line of code.The paper explores the concept of "Scaling on Scales" (S²), a method that scales vision models by running them on multiple image scales rather than increasing the model size. The authors demonstrate that smaller, pre-trained vision models (e.g., ViT-B or ViT-L) can outperform larger models (e.g., ViT-H or ViT-G) on various tasks such as classification, segmentation, depth estimation, and robotic manipulation. S² achieves state-of-the-art performance in detailed understanding of Multimodal LLMs (MLLMs) on the V* benchmark, surpassing models like GPT-4V. The paper also examines the conditions under which S² is more effective than scaling model size, finding that while larger models generally generalize better on hard examples, the features of larger models can be well approximated by multi-scale smaller models. Pre-training smaller models with S² can match or even exceed the performance of larger models. The authors release a Python package that applies S² to any vision model with a single line of code.