LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

2024-06-30 | Mushui Liu, Yuhang Ma, Xinfeng Zhang, Zhen Yang, Zeng Zhao, Zhipeng Hu, Bai Liu, Changjie Fan
LLM4GEN is a text-to-image generation framework that leverages the semantic representation of Large Language Models (LLMs) to enhance the performance of diffusion models. The framework introduces a Cross-Adapter Module (CAM) that integrates LLM features with original text encoders, such as CLIP, to improve text-to-image generation. LLM4GEN is designed to be compatible with various diffusion models, including SD1.5 and SDXL, and it significantly improves the semantic alignment and image-text alignment of generated images. The framework is evaluated on a comprehensive benchmark called DensePrompts, which includes over 7,000 dense prompts with detailed descriptions. Additionally, a LAION-refined dataset is introduced to provide a large-scale training set for text-to-image generation. Experiments show that LLM4GEN outperforms existing state-of-the-art models in terms of sample quality, image-text alignment, and human evaluation. The proposed framework demonstrates strong performance in handling complex and dense prompts, and it effectively reduces the training data and computational resources required for diffusion models. LLM4GEN's ability to enhance the representation of text encoders and improve image-text alignment makes it a promising approach for text-to-image generation.LLM4GEN is a text-to-image generation framework that leverages the semantic representation of Large Language Models (LLMs) to enhance the performance of diffusion models. The framework introduces a Cross-Adapter Module (CAM) that integrates LLM features with original text encoders, such as CLIP, to improve text-to-image generation. LLM4GEN is designed to be compatible with various diffusion models, including SD1.5 and SDXL, and it significantly improves the semantic alignment and image-text alignment of generated images. The framework is evaluated on a comprehensive benchmark called DensePrompts, which includes over 7,000 dense prompts with detailed descriptions. Additionally, a LAION-refined dataset is introduced to provide a large-scale training set for text-to-image generation. Experiments show that LLM4GEN outperforms existing state-of-the-art models in terms of sample quality, image-text alignment, and human evaluation. The proposed framework demonstrates strong performance in handling complex and dense prompts, and it effectively reduces the training data and computational resources required for diffusion models. LLM4GEN's ability to enhance the representation of text encoders and improve image-text alignment makes it a promising approach for text-to-image generation.
Reach us at info@study.space
[slides and audio] LLM4GEN%3A Leveraging Semantic Representation of LLMs for Text-to-Image Generation