16 Jun 2024 | Xiaoxiao Ma, Mohan Zhou, Tao Liang, Yalong Bai, Tiejun Zhao, Huai'an Chen, Yi Jin
**STAR: SCALE-WISE TEXT-TO-IMAGE GENERATION VIA AUTO-REGRESSIVE REPRESENTATIONS**
**Authors:** Xiaoxiao Ma, Mohan Zhou, Tao Liang, Yalong Bai, Tiejun Zhao, Huai'an Chen, Yi Jin
**Abstract:**
STAR is a text-to-image model that employs a scale-wise auto-regressive paradigm, enabling open-set generation through three key designs: a pre-trained text encoder for extracting textual constraints, additional cross-attention layers for improved interaction between generated images and fine-grained textual guidance, and a normalized 2D Rotary Positional Encoding (RoPE) for consistent interpretation of relative positions across different scales. Extensive experiments demonstrate that STAR outperforms existing benchmarks in terms of fidelity, image-text consistency, and aesthetic quality, highlighting the potential of auto-regressive methods in high-quality image synthesis.
**Introduction:**
Text-to-Image (T2I) generation has emerged as a major trend in computer vision, allowing individuals to create realistic and imaginative images. Current T2I models, such as VAEs, GANs, and diffusion models, have limitations in terms of diversity, controllability, and efficiency. Auto-regressive (AR) models, like PixelRNN and PixelCNN, aim to directly model the image distribution but often overlook the conflicts between bi-directional and 2D structural correlations in image patch tokens. STAR revisits the "next-scale prediction" mechanism, evolving it into a general open-set T2I model.
**Method:**
- **Next-scale Prediction:** STAR predicts discrete latent space feature maps in a scale-wise manner, using pooling features and cross-attention for text guidance.
- **Normalized RoPE:** Ensures consistent interpretation of relative positions across different scales, stabilizing the training process.
- **Efficient Textual Guidance:** Utilizes pooled text features as start tokens and injects cross-attention mechanisms for detailed textual guidance at each scale.
**Experiments:**
- **Model Parameters:** STAR adopts a decoder-only transformer architecture with approximately 1.7 billion parameters.
- **Datasets:** Training datasets include JourneyDB, LAION-HD, and LAION-Art.
- **Training Details:** The transformer predicts concatenated token maps for all scales, with normalized RoPE providing consistency across different scales.
- **Performance Comparisons:** STAR achieves superior performance in FID, CLIP-Score, and human preferences, generating high-quality 512x512 images with stunning details in approximately 2.9 seconds.
**Future Works:**
- Higher resolutions: Developing more efficient training strategies for larger scale generation.
- More efficient sampling strategies: Improving diversity and detail richness in generated images.
- Downstream tasks: Exploring the application of STAR in controllable generation and image editing.
**Conclusion:**
STAR introduces a new auto-regressive paradigm for efficient text-to-image synthesis, achieving superior performance in fidelity, text-image alignment, and**STAR: SCALE-WISE TEXT-TO-IMAGE GENERATION VIA AUTO-REGRESSIVE REPRESENTATIONS**
**Authors:** Xiaoxiao Ma, Mohan Zhou, Tao Liang, Yalong Bai, Tiejun Zhao, Huai'an Chen, Yi Jin
**Abstract:**
STAR is a text-to-image model that employs a scale-wise auto-regressive paradigm, enabling open-set generation through three key designs: a pre-trained text encoder for extracting textual constraints, additional cross-attention layers for improved interaction between generated images and fine-grained textual guidance, and a normalized 2D Rotary Positional Encoding (RoPE) for consistent interpretation of relative positions across different scales. Extensive experiments demonstrate that STAR outperforms existing benchmarks in terms of fidelity, image-text consistency, and aesthetic quality, highlighting the potential of auto-regressive methods in high-quality image synthesis.
**Introduction:**
Text-to-Image (T2I) generation has emerged as a major trend in computer vision, allowing individuals to create realistic and imaginative images. Current T2I models, such as VAEs, GANs, and diffusion models, have limitations in terms of diversity, controllability, and efficiency. Auto-regressive (AR) models, like PixelRNN and PixelCNN, aim to directly model the image distribution but often overlook the conflicts between bi-directional and 2D structural correlations in image patch tokens. STAR revisits the "next-scale prediction" mechanism, evolving it into a general open-set T2I model.
**Method:**
- **Next-scale Prediction:** STAR predicts discrete latent space feature maps in a scale-wise manner, using pooling features and cross-attention for text guidance.
- **Normalized RoPE:** Ensures consistent interpretation of relative positions across different scales, stabilizing the training process.
- **Efficient Textual Guidance:** Utilizes pooled text features as start tokens and injects cross-attention mechanisms for detailed textual guidance at each scale.
**Experiments:**
- **Model Parameters:** STAR adopts a decoder-only transformer architecture with approximately 1.7 billion parameters.
- **Datasets:** Training datasets include JourneyDB, LAION-HD, and LAION-Art.
- **Training Details:** The transformer predicts concatenated token maps for all scales, with normalized RoPE providing consistency across different scales.
- **Performance Comparisons:** STAR achieves superior performance in FID, CLIP-Score, and human preferences, generating high-quality 512x512 images with stunning details in approximately 2.9 seconds.
**Future Works:**
- Higher resolutions: Developing more efficient training strategies for larger scale generation.
- More efficient sampling strategies: Improving diversity and detail richness in generated images.
- Downstream tasks: Exploring the application of STAR in controllable generation and image editing.
**Conclusion:**
STAR introduces a new auto-regressive paradigm for efficient text-to-image synthesis, achieving superior performance in fidelity, text-image alignment, and