8 Mar 2024 | Xiwei Hu*, Rui Wang*, Yixiao Fang*, Bin Fu*, Pei Cheng, and Gang Yu**
The paper introduces ELLA, a method that enhances text-to-image diffusion models by integrating powerful Large Language Models (LLMs) without the need for training the U-Net or LLM. ELLA employs a Timestep-Aware Semantic Connector (TSC) to dynamically extract timestep-dependent conditions from LLMs, improving text alignment during the denoising process. The TSC is designed to adapt semantic features at different stages of denoising, enabling the model to interpret lengthy and complex prompts. The authors also introduce the Dense Prompt Graph Benchmark (DPG-Bench), a comprehensive dataset with 1,065 dense prompts, to evaluate the model's performance on dense prompt following. Extensive experiments demonstrate that ELLA outperforms state-of-the-art methods in dense prompt following, particularly in scenarios involving multiple objects with diverse attributes and relationships. The paper highlights the superior semantic alignment capabilities of ELLA and its potential for enhancing community models and downstream tools.The paper introduces ELLA, a method that enhances text-to-image diffusion models by integrating powerful Large Language Models (LLMs) without the need for training the U-Net or LLM. ELLA employs a Timestep-Aware Semantic Connector (TSC) to dynamically extract timestep-dependent conditions from LLMs, improving text alignment during the denoising process. The TSC is designed to adapt semantic features at different stages of denoising, enabling the model to interpret lengthy and complex prompts. The authors also introduce the Dense Prompt Graph Benchmark (DPG-Bench), a comprehensive dataset with 1,065 dense prompts, to evaluate the model's performance on dense prompt following. Extensive experiments demonstrate that ELLA outperforms state-of-the-art methods in dense prompt following, particularly in scenarios involving multiple objects with diverse attributes and relationships. The paper highlights the superior semantic alignment capabilities of ELLA and its potential for enhancing community models and downstream tools.