10 Jun 2024 | Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, Liwei Wang
The paper introduces Visual AutoRegressive (VAR) modeling, a novel approach to image generation that redefines autoregressive learning on images as "next-scale prediction" or "next-resolution prediction," diverging from the standard raster-scan "next-token prediction." This methodology allows autoregressive (AR) transformers to learn visual distributions more efficiently and generalize well. VAR models achieve significant improvements over existing AR and diffusion models in terms of image quality, inference speed, data efficiency, and scalability. On the ImageNet 256×256 benchmark, VAR reduces the Fréchet inception distance (FID) from 18.65 to 1.73 and the inception score (IS) from 80.4 to 350.2, with a 20× faster inference speed. VAR also demonstrates zero-shot generalization in tasks such as image in-painting, out-painting, and editing. The paper provides a comprehensive open-source code suite to promote the exploration of AR/VAR models for visual generation and unified learning.The paper introduces Visual AutoRegressive (VAR) modeling, a novel approach to image generation that redefines autoregressive learning on images as "next-scale prediction" or "next-resolution prediction," diverging from the standard raster-scan "next-token prediction." This methodology allows autoregressive (AR) transformers to learn visual distributions more efficiently and generalize well. VAR models achieve significant improvements over existing AR and diffusion models in terms of image quality, inference speed, data efficiency, and scalability. On the ImageNet 256×256 benchmark, VAR reduces the Fréchet inception distance (FID) from 18.65 to 1.73 and the inception score (IS) from 80.4 to 350.2, with a 20× faster inference speed. VAR also demonstrates zero-shot generalization in tasks such as image in-painting, out-painting, and editing. The paper provides a comprehensive open-source code suite to promote the exploration of AR/VAR models for visual generation and unified learning.