10 Jun 2024 | Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, Zehuan Yuan
LlamaGen is a new family of image generation models that apply the "next-token prediction" paradigm from large language models (LLMs) to the visual domain. This work demonstrates that vanilla autoregressive models, such as Llama, can achieve state-of-the-art image generation performance when scaled appropriately, without relying on visual inductive biases. The research explores image tokenizers, scalability of image generation models, and training data quality. Key contributions include: (1) An image tokenizer with a 16x downsample ratio, achieving 0.94 rFID and 97% codebook usage on ImageNet. (2) A series of class-conditional image generation models ranging from 111M to 3.1B parameters, achieving 2.18 FID on ImageNet 256×256 benchmarks, outperforming diffusion models like LDM and DiT. (3) A text-conditional image generation model with 775M parameters, trained on LAION-COCO and high aesthetic quality images, demonstrating competitive visual quality and text alignment. (4) A 326%–414% speedup in inference using vLLM, a popular LLM serving framework. All models and codes are released to support the open-source community of visual generation and multimodal foundation models. The work shows that autoregressive models can serve as a foundation for image generation systems, with potential for further improvements with more training data and resources.LlamaGen is a new family of image generation models that apply the "next-token prediction" paradigm from large language models (LLMs) to the visual domain. This work demonstrates that vanilla autoregressive models, such as Llama, can achieve state-of-the-art image generation performance when scaled appropriately, without relying on visual inductive biases. The research explores image tokenizers, scalability of image generation models, and training data quality. Key contributions include: (1) An image tokenizer with a 16x downsample ratio, achieving 0.94 rFID and 97% codebook usage on ImageNet. (2) A series of class-conditional image generation models ranging from 111M to 3.1B parameters, achieving 2.18 FID on ImageNet 256×256 benchmarks, outperforming diffusion models like LDM and DiT. (3) A text-conditional image generation model with 775M parameters, trained on LAION-COCO and high aesthetic quality images, demonstrating competitive visual quality and text alignment. (4) A 326%–414% speedup in inference using vLLM, a popular LLM serving framework. All models and codes are released to support the open-source community of visual generation and multimodal foundation models. The work shows that autoregressive models can serve as a foundation for image generation systems, with potential for further improvements with more training data and resources.