[slides] Controllable Generation with Text-to-Image Diffusion Models%3A A Survey

This survey provides a comprehensive review of controllable generation using text-to-image (T2I) diffusion models, covering both theoretical foundations and practical advancements. The authors begin by introducing denoising diffusion probabilistic models (DDPMs) and widely used T2I diffusion models, such as GLIDE, Imagen, DALL-E 2, LDM, and Stable Diffusion. They then delve into the controlling mechanisms of diffusion models, explaining how novel conditions are introduced into the denoising process for conditional generation. The survey categorizes controllable generation methods into three sub-tasks: generation with specific conditions, generation with multiple conditions, and universal controllable generation. It discusses various approaches for each category, including tuning-based, model-based, and training-free methods. For example, tuning-based methods adapt model parameters or embeddings to cater to specific conditions, while model-based methods use encoders to extract personalized conditions and feed them into diffusion models. Training-free methods leverage external references to control the generative process without training. The survey also explores advanced text-conditioned generation techniques, addressing challenges such as textual misalignment and the lack of multilingual support. It highlights the importance of preserving the generative model's broader applicability and editability, especially when using small-scale datasets. Overall, the survey aims to provide a thorough understanding of the current state of controllable generation with T2I diffusion models, offering insights into their theoretical foundations, practical applications, and future research directions.This survey provides a comprehensive review of controllable generation using text-to-image (T2I) diffusion models, covering both theoretical foundations and practical advancements. The authors begin by introducing denoising diffusion probabilistic models (DDPMs) and widely used T2I diffusion models, such as GLIDE, Imagen, DALL-E 2, LDM, and Stable Diffusion. They then delve into the controlling mechanisms of diffusion models, explaining how novel conditions are introduced into the denoising process for conditional generation. The survey categorizes controllable generation methods into three sub-tasks: generation with specific conditions, generation with multiple conditions, and universal controllable generation. It discusses various approaches for each category, including tuning-based, model-based, and training-free methods. For example, tuning-based methods adapt model parameters or embeddings to cater to specific conditions, while model-based methods use encoders to extract personalized conditions and feed them into diffusion models. Training-free methods leverage external references to control the generative process without training. The survey also explores advanced text-conditioned generation techniques, addressing challenges such as textual misalignment and the lack of multilingual support. It highlights the importance of preserving the generative model's broader applicability and editability, especially when using small-scale datasets. Overall, the survey aims to provide a thorough understanding of the current state of controllable generation with T2I diffusion models, offering insights into their theoretical foundations, practical applications, and future research directions.

Controllable Generation with Text-to-Image Diffusion Models: A Survey

7 Mar 2024 | Pu Cao, Feng Zhou, Qing Song, Lu Yang