ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

8 Mar 2024 | Xiwei Hu*, Rui Wang*, Yixiao Fang*, Bin Fu, Pei Cheng, and Gang Yu**
ELLA is a method that enhances text-to-image diffusion models by integrating powerful Large Language Models (LLMs) without training the U-Net or LLM. The approach introduces a Timestep-Aware Semantic Connector (TSC) to dynamically extract timestep-dependent conditions from the LLM, enabling the diffusion model to interpret complex and long prompts. ELLA improves text alignment and prompt-following capabilities by adapting semantic features at different stages of the denoising process. The method is compatible with various community models and tools, enhancing their performance. To evaluate dense prompt following, the authors introduce the Dense Prompt Graph Benchmark (DPG-Bench), a dataset of 1,065 dense prompts. Extensive experiments show that ELLA outperforms state-of-the-art models in dense prompt following, particularly in handling multiple objects with diverse attributes and relationships. The TSC design allows for efficient and effective semantic conditioning during image generation, and the model is lightweight and adaptable. The results demonstrate that ELLA achieves superior performance in text-image alignment and can be seamlessly integrated with downstream tools. The method also shows strong performance in user studies, indicating its effectiveness in generating high-quality images that align with complex prompts. The approach is compatible with various diffusion models and can be applied to a wide range of text-to-image generation tasks.ELLA is a method that enhances text-to-image diffusion models by integrating powerful Large Language Models (LLMs) without training the U-Net or LLM. The approach introduces a Timestep-Aware Semantic Connector (TSC) to dynamically extract timestep-dependent conditions from the LLM, enabling the diffusion model to interpret complex and long prompts. ELLA improves text alignment and prompt-following capabilities by adapting semantic features at different stages of the denoising process. The method is compatible with various community models and tools, enhancing their performance. To evaluate dense prompt following, the authors introduce the Dense Prompt Graph Benchmark (DPG-Bench), a dataset of 1,065 dense prompts. Extensive experiments show that ELLA outperforms state-of-the-art models in dense prompt following, particularly in handling multiple objects with diverse attributes and relationships. The TSC design allows for efficient and effective semantic conditioning during image generation, and the model is lightweight and adaptable. The results demonstrate that ELLA achieves superior performance in text-image alignment and can be seamlessly integrated with downstream tools. The method also shows strong performance in user studies, indicating its effectiveness in generating high-quality images that align with complex prompts. The approach is compatible with various diffusion models and can be applied to a wide range of text-to-image generation tasks.
Reach us at info@study.space