16 Jan 2024 | Alaaeldin El-Nouby, Michal Klein, Shuangfei Zhai, Miguel Angel Bautista, Alexander Toshev, Vaishal Shankar, Joshua M Susskind, Armand Joulin*
This paper introduces AIM (Autoregressive Image Models), a collection of vision models pre-trained with an autoregressive objective, inspired by Large Language Models (LLMs). The key findings are that visual feature performance scales with both model capacity and data quantity, and the objective function value correlates with downstream task performance. AIM is pre-trained on 2 billion images, achieving 84.0% accuracy on ImageNet-Ik with a frozen trunk, showing no signs of saturation even at this scale. The pre-training process is similar to LLMs and does not require specific stability-inducing techniques. The paper explores the impact of scaling, dataset choice, and architectural modifications, demonstrating strong scaling behavior and consistent improvement in downstream performance. AIM outperforms state-of-the-art methods in 15 image recognition benchmarks, narrowing the gap between generative and joint embedding approaches. The paper also discusses limitations and future directions, highlighting the potential for further improvements with larger models and longer training schedules.This paper introduces AIM (Autoregressive Image Models), a collection of vision models pre-trained with an autoregressive objective, inspired by Large Language Models (LLMs). The key findings are that visual feature performance scales with both model capacity and data quantity, and the objective function value correlates with downstream task performance. AIM is pre-trained on 2 billion images, achieving 84.0% accuracy on ImageNet-Ik with a frozen trunk, showing no signs of saturation even at this scale. The pre-training process is similar to LLMs and does not require specific stability-inducing techniques. The paper explores the impact of scaling, dataset choice, and architectural modifications, demonstrating strong scaling behavior and consistent improvement in downstream performance. AIM outperforms state-of-the-art methods in 15 image recognition benchmarks, narrowing the gap between generative and joint embedding approaches. The paper also discusses limitations and future directions, highlighting the potential for further improvements with larger models and longer training schedules.