Scalable Pre-training of Large Autoregressive Image Models

Scalable Pre-training of Large Autoregressive Image Models

16 Jan 2024 | Alaaeldin El-Nouby, Michal Klein, Shuangfei Zhai, Miguel Angel Bautista, Alexander Toshev, Vaishaal Shankar, Joshua M Susskind, Armand Joulin
AIM is a collection of vision models pre-trained with an autoregressive objective, inspired by large language models (LLMs). These models exhibit similar scaling properties, with performance improving as model capacity and data quantity increase. The objective function's value correlates with downstream task performance. AIM achieves 84.0% accuracy on ImageNet-1k with a frozen trunk, showing no signs of saturation. Pre-training AIM is similar to LLM pre-training, requiring no image-specific strategies for stability. AIM uses a vision transformer (ViT) architecture with modifications for autoregressive pre-training. A prefix attention mechanism allows bidirectional attention during downstream tasks, while a parameterized token-level prediction head improves feature quality. Training AIM is similar to LLMs, without stability-inducing techniques. AIM is pre-trained on the DFN-2B dataset, consisting of 2 billion images. The training objective involves predicting image patches in a sequence, with a normalized pixel-level regression loss. The model uses a prefix attention mechanism to enable bidirectional attention during downstream tasks, improving performance. AIM's performance scales with model size and training data. Pre-training on larger datasets improves downstream performance, with no signs of saturation. AIM outperforms other methods on 15 image recognition benchmarks, achieving strong results with a frozen trunk. AIM is compatible with low-rank adaptation (LoRA), showing significant performance improvements over frozen-trunk evaluations. AIM demonstrates strong scaling properties, with performance improving as model capacity and data increase. The pre-training objective is effective for visual feature learning, and AIM achieves strong results across multiple benchmarks. AIM's performance is comparable to other state-of-the-art methods, with improvements in accuracy and performance on image recognition tasks. AIM's results suggest that large-scale vision models can benefit from autoregressive pre-training, with potential for further improvements with larger models and longer pre-training schedules.AIM is a collection of vision models pre-trained with an autoregressive objective, inspired by large language models (LLMs). These models exhibit similar scaling properties, with performance improving as model capacity and data quantity increase. The objective function's value correlates with downstream task performance. AIM achieves 84.0% accuracy on ImageNet-1k with a frozen trunk, showing no signs of saturation. Pre-training AIM is similar to LLM pre-training, requiring no image-specific strategies for stability. AIM uses a vision transformer (ViT) architecture with modifications for autoregressive pre-training. A prefix attention mechanism allows bidirectional attention during downstream tasks, while a parameterized token-level prediction head improves feature quality. Training AIM is similar to LLMs, without stability-inducing techniques. AIM is pre-trained on the DFN-2B dataset, consisting of 2 billion images. The training objective involves predicting image patches in a sequence, with a normalized pixel-level regression loss. The model uses a prefix attention mechanism to enable bidirectional attention during downstream tasks, improving performance. AIM's performance scales with model size and training data. Pre-training on larger datasets improves downstream performance, with no signs of saturation. AIM outperforms other methods on 15 image recognition benchmarks, achieving strong results with a frozen trunk. AIM is compatible with low-rank adaptation (LoRA), showing significant performance improvements over frozen-trunk evaluations. AIM demonstrates strong scaling properties, with performance improving as model capacity and data increase. The pre-training objective is effective for visual feature learning, and AIM achieves strong results across multiple benchmarks. AIM's performance is comparable to other state-of-the-art methods, with improvements in accuracy and performance on image recognition tasks. AIM's results suggest that large-scale vision models can benefit from autoregressive pre-training, with potential for further improvements with larger models and longer pre-training schedules.
Reach us at info@study.space
[slides] Scalable Pre-training of Large Autoregressive Image Models | StudySpace