Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

10 Jun 2024 | Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, Liwei Wang
Visual AutoRegressive (VAR) modeling is a new approach to image generation that redefines autoregressive learning as "next-scale prediction" rather than "next-token prediction." This method allows autoregressive (AR) transformers to learn visual distributions efficiently and generalize well. VAR significantly improves AR baselines on the ImageNet 256×256 benchmark, achieving a Fréchet inception distance (FID) of 1.73 and an inception score (IS) of 350.2, with 20× faster inference speed. It outperforms the Diffusion Transformer (DiT) in multiple dimensions, including image quality, inference speed, data efficiency, and scalability. VAR also demonstrates zero-shot generalization in tasks like image in-painting, out-painting, and editing. These results suggest that VAR has initially emulated the two important properties of large language models (LLMs): scaling laws and zero-shot generalization. VAR is released with all models and codes to promote the exploration of AR/VAR models for visual generation and unified learning.Visual AutoRegressive (VAR) modeling is a new approach to image generation that redefines autoregressive learning as "next-scale prediction" rather than "next-token prediction." This method allows autoregressive (AR) transformers to learn visual distributions efficiently and generalize well. VAR significantly improves AR baselines on the ImageNet 256×256 benchmark, achieving a Fréchet inception distance (FID) of 1.73 and an inception score (IS) of 350.2, with 20× faster inference speed. It outperforms the Diffusion Transformer (DiT) in multiple dimensions, including image quality, inference speed, data efficiency, and scalability. VAR also demonstrates zero-shot generalization in tasks like image in-painting, out-painting, and editing. These results suggest that VAR has initially emulated the two important properties of large language models (LLMs): scaling laws and zero-shot generalization. VAR is released with all models and codes to promote the exploration of AR/VAR models for visual generation and unified learning.
Reach us at info@study.space