MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis

MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis

11 Jul 2024 | Wanggui He1*, Siming Fu1*, Mushui Liu2*, Xierui Wang2+, Wenyi Xiao2+, Fangxun Shu1+, Yi Wang2, Lei Zhang2, Zhelun Yu3, Haoyuan Li3, Ziwei Huang2+, LeiLei Gan2, Hao Jiang1,†
MARS is an innovative framework for text-to-image (T2I) generation that integrates a Semantic Vision-Language Integration Expert (SemVIE) module. This module combines pre-trained Large Language Models (LLMs) with visual expertise, preserving the NLP capabilities of LLMs while enhancing their visual understanding. MARS stands out with its ability to generate high-quality, fine-grained images that closely adhere to textual descriptions, achieving remarkable results across various benchmarks. The framework employs a multi-stage training strategy, starting with robust image-text alignment and refining the T2I generation process. MARS requires only 9% of the GPU days needed by SD1.5, demonstrating superior efficiency and performance. The model supports both English and Chinese prompts and can generate images with detailed visual features, such as animal fur, plant foliage, and facial features. MARS's flexibility and adaptability make it suitable for a wide range of applications, including any-to-any tasks.MARS is an innovative framework for text-to-image (T2I) generation that integrates a Semantic Vision-Language Integration Expert (SemVIE) module. This module combines pre-trained Large Language Models (LLMs) with visual expertise, preserving the NLP capabilities of LLMs while enhancing their visual understanding. MARS stands out with its ability to generate high-quality, fine-grained images that closely adhere to textual descriptions, achieving remarkable results across various benchmarks. The framework employs a multi-stage training strategy, starting with robust image-text alignment and refining the T2I generation process. MARS requires only 9% of the GPU days needed by SD1.5, demonstrating superior efficiency and performance. The model supports both English and Chinese prompts and can generate images with detailed visual features, such as animal fur, plant foliage, and facial features. MARS's flexibility and adaptability make it suitable for a wide range of applications, including any-to-any tasks.
Reach us at info@study.space
[slides] MARS%3A Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis | StudySpace