Understanding MARS%3A Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis

MARS is a novel framework for fine-grained text-to-image synthesis that integrates a specialized Semantic Vision-Language Integration Expert (SemVIE) to enhance visual understanding while preserving the natural language processing capabilities of pre-trained large language models (LLMs). The framework leverages a pre-trained Qwen-7B model and enables bilingual generation in both English and Chinese. It employs a multi-stage training strategy to improve image-text alignment and image generation quality, achieving high performance with significantly less computational resources than existing models like SD1.5. MARS demonstrates strong performance on benchmarks such as MS-COCO and T2I-CompBench, and excels in generating high-resolution images with detailed visual content. The framework also supports joint image-text generation and is capable of handling any-to-any task adaptability. MARS is capable of generating detailed, high-quality images that closely align with textual descriptions, and it has been shown to effectively generate recipes with corresponding images. The model's ability to generate both text and images simultaneously makes it a versatile tool for various applications, including recipe generation and other multimodal tasks.MARS is a novel framework for fine-grained text-to-image synthesis that integrates a specialized Semantic Vision-Language Integration Expert (SemVIE) to enhance visual understanding while preserving the natural language processing capabilities of pre-trained large language models (LLMs). The framework leverages a pre-trained Qwen-7B model and enables bilingual generation in both English and Chinese. It employs a multi-stage training strategy to improve image-text alignment and image generation quality, achieving high performance with significantly less computational resources than existing models like SD1.5. MARS demonstrates strong performance on benchmarks such as MS-COCO and T2I-CompBench, and excels in generating high-resolution images with detailed visual content. The framework also supports joint image-text generation and is capable of handling any-to-any task adaptability. MARS is capable of generating detailed, high-quality images that closely align with textual descriptions, and it has been shown to effectively generate recipes with corresponding images. The model's ability to generate both text and images simultaneously makes it a versatile tool for various applications, including recipe generation and other multimodal tasks.

MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis

11 Jul 2024 | Wanggui He, Siming Fu, Mushui Liu, Xierui Wang, Wenyi Xiao, Fangxun Shu, Yi Wang, Lei Zhang, Zhelun Yu, Haoyuan Li, Ziwei Huang, LeiLei Gan, Hao Jiang