LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

30 Jun 2024 | Mushui Liu*, Yuhang Ma*, Zhen Yang, Zeng Zhao, Xinfeng Zhang, Zhipeng Hu, Bai Liu, Changjie Fan
The paper introduces LLM4GEN, an end-to-end text-to-image generation framework that leverages the semantic representation of Large Language Models (LLMs) to enhance the semantic understanding of diffusion models. The framework includes a Cross-Adapter Module (CAM) that integrates LLM features with the original text encoder, improving the alignment between text and images. LLM4GEN is designed to be plug-and-play with existing diffusion models like SD1.5 and SDXL, requiring only 10% of the training data compared to recent methods like ELLA. The paper also introduces DensePrompts, a comprehensive benchmark with over 7,000 dense prompts, and a LAION-refined dataset with 1 million text-image pairs. Experiments show that LLM4GEN significantly improves the semantic alignment of SD1.5 and SDXL, achieving 7.69% and 9.60% increases in color on T2I-CompBench, respectively. LLM4GEN outperforms existing state-of-the-art models in sample quality, image-text alignment, and human evaluation. The paper concludes by highlighting the efficiency and performance of LLM4GEN, which reduces training data and computational costs while maintaining superior results.The paper introduces LLM4GEN, an end-to-end text-to-image generation framework that leverages the semantic representation of Large Language Models (LLMs) to enhance the semantic understanding of diffusion models. The framework includes a Cross-Adapter Module (CAM) that integrates LLM features with the original text encoder, improving the alignment between text and images. LLM4GEN is designed to be plug-and-play with existing diffusion models like SD1.5 and SDXL, requiring only 10% of the training data compared to recent methods like ELLA. The paper also introduces DensePrompts, a comprehensive benchmark with over 7,000 dense prompts, and a LAION-refined dataset with 1 million text-image pairs. Experiments show that LLM4GEN significantly improves the semantic alignment of SD1.5 and SDXL, achieving 7.69% and 9.60% increases in color on T2I-CompBench, respectively. LLM4GEN outperforms existing state-of-the-art models in sample quality, image-text alignment, and human evaluation. The paper concludes by highlighting the efficiency and performance of LLM4GEN, which reduces training data and computational costs while maintaining superior results.
Reach us at info@study.space