10 Jun 2024 | Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, Zehuan Yuan
The paper introduces LlamaGen, a new family of image generation models that apply the "next-token prediction" paradigm from large language models to the visual domain. The authors reexamine the design spaces of image tokenizers, scalability properties of image generation models, and the quality of training data. Key contributions include:
1. **Image Tokenizer**: An image tokenizer with a downsample ratio of 16, achieving a reconstruction quality of 0.94 rFID and codebook usage of 97% on the ImageNet benchmark.
2. **Class-Conditional Image Generation Models**: A series of models ranging from 111M to 3.1B parameters, achieving a 2.18 FID on the ImageNet 256x256 benchmark, outperforming popular diffusion models like LDM and DIT.
3. **Text-Conditional Image Generation Model**: A 775M parameter model trained on a subset of LAION-COCO and fine-tuned on high-aesthetics images, demonstrating competitive performance in visual quality and text alignment.
4. **Optimized Inference Speed**: Using the vLLM framework, the authors achieve a 326% to 414% speedup in inference speed.
The paper also discusses the effectiveness of LLM serving frameworks in optimizing inference speed and releases all models and codes to facilitate the open-source community of visual generation and multimodal foundation models. The work demonstrates that vanilla autoregressive models can achieve state-of-the-art image generation performance when properly scaled and optimized.The paper introduces LlamaGen, a new family of image generation models that apply the "next-token prediction" paradigm from large language models to the visual domain. The authors reexamine the design spaces of image tokenizers, scalability properties of image generation models, and the quality of training data. Key contributions include:
1. **Image Tokenizer**: An image tokenizer with a downsample ratio of 16, achieving a reconstruction quality of 0.94 rFID and codebook usage of 97% on the ImageNet benchmark.
2. **Class-Conditional Image Generation Models**: A series of models ranging from 111M to 3.1B parameters, achieving a 2.18 FID on the ImageNet 256x256 benchmark, outperforming popular diffusion models like LDM and DIT.
3. **Text-Conditional Image Generation Model**: A 775M parameter model trained on a subset of LAION-COCO and fine-tuned on high-aesthetics images, demonstrating competitive performance in visual quality and text alignment.
4. **Optimized Inference Speed**: Using the vLLM framework, the authors achieve a 326% to 414% speedup in inference speed.
The paper also discusses the effectiveness of LLM serving frameworks in optimizing inference speed and releases all models and codes to facilitate the open-source community of visual generation and multimodal foundation models. The work demonstrates that vanilla autoregressive models can achieve state-of-the-art image generation performance when properly scaled and optimized.