STAR: SCALE-WISE TEXT-TO-IMAGE GENERATION VIA AUTO-REGRESSIVE REPRESENTATIONS

STAR: SCALE-WISE TEXT-TO-IMAGE GENERATION VIA AUTO-REGRESSIVE REPRESENTATIONS

2024-06-16 | Xiaoxiao Ma, Mohan Zhou, Tao Liang, Yalong Bai, Tiejun Zhao, Huaian Chen, Yi Jin
STAR is a text-to-image generation model that employs a scale-wise auto-regressive paradigm. Unlike VAR, which is limited to class-conditioned synthesis within a fixed set of predetermined categories, STAR enables text-driven open-set generation through three key designs: to boost diversity and generalizability with unseen combinations of objects and concepts, a pre-trained text encoder is introduced to extract representations for textual constraints, which are then used as guidance. To improve the interactions between generated images and fine-grained textual guidance, additional cross-attention layers are incorporated at each scale. Given the natural structure correlation across different scales, 2D Rotary Positional Encoding (RoPE) is leveraged and modified into a normalized version, ensuring consistent interpretation of relative positions across token maps at different scales and stabilizing the training process. Extensive experiments demonstrate that STAR surpasses existing benchmarks in terms of fidelity, image-text consistency, and aesthetic quality. STAR offers promising new directions for the T2I field currently dominated by diffusion methods. STAR can generate images with enriched visual details, such as animal hair, plant leaves, and facial features, while demonstrating remarkable capability for fine-grained alignment with textual guidance. The main contributions of STAR include proposing a novel auto-regressive model for open-set text-to-image generation, employing features from pre-trained text encoders with cross-attention layers for detailed textual guidance, and developing a new normalized RoPE to enhance training effectiveness and ensure consistent interpretation of relative positions across different scales. STAR achieves remarkable performance in fidelity, text-image consistency, particularly in producing highly detailed images with more efficiency. The model is efficient and scalable, with a standard decoder-only transformer architecture, and is trained on large-scale datasets. STAR outperforms existing methods in terms of fidelity, image-text consistency, and human preference. It can generate high-quality 512×512 images with stunning details in approximately 2.9 seconds. Compared to leading diffusion models, STAR offers significant time advantages and produces images with greater detail, presenting a promising new direction in the currently diffusion-dominated field of text-to-image generation.STAR is a text-to-image generation model that employs a scale-wise auto-regressive paradigm. Unlike VAR, which is limited to class-conditioned synthesis within a fixed set of predetermined categories, STAR enables text-driven open-set generation through three key designs: to boost diversity and generalizability with unseen combinations of objects and concepts, a pre-trained text encoder is introduced to extract representations for textual constraints, which are then used as guidance. To improve the interactions between generated images and fine-grained textual guidance, additional cross-attention layers are incorporated at each scale. Given the natural structure correlation across different scales, 2D Rotary Positional Encoding (RoPE) is leveraged and modified into a normalized version, ensuring consistent interpretation of relative positions across token maps at different scales and stabilizing the training process. Extensive experiments demonstrate that STAR surpasses existing benchmarks in terms of fidelity, image-text consistency, and aesthetic quality. STAR offers promising new directions for the T2I field currently dominated by diffusion methods. STAR can generate images with enriched visual details, such as animal hair, plant leaves, and facial features, while demonstrating remarkable capability for fine-grained alignment with textual guidance. The main contributions of STAR include proposing a novel auto-regressive model for open-set text-to-image generation, employing features from pre-trained text encoders with cross-attention layers for detailed textual guidance, and developing a new normalized RoPE to enhance training effectiveness and ensure consistent interpretation of relative positions across different scales. STAR achieves remarkable performance in fidelity, text-image consistency, particularly in producing highly detailed images with more efficiency. The model is efficient and scalable, with a standard decoder-only transformer architecture, and is trained on large-scale datasets. STAR outperforms existing methods in terms of fidelity, image-text consistency, and human preference. It can generate high-quality 512×512 images with stunning details in approximately 2.9 seconds. Compared to leading diffusion models, STAR offers significant time advantages and produces images with greater detail, presenting a promising new direction in the currently diffusion-dominated field of text-to-image generation.
Reach us at info@study.space