Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance

Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance

11 Jun 2024 | Kuan Heng Lin, Sicheng Mo, Ben Klingher, Fangzhou Mu, Bolei Zhou
Ctrl-X is a training-free and guidance-free framework for text-to-image (T2I) diffusion models that enables structure and appearance control without additional training or guidance. It allows users to generate images that inherit the structure from a structure image and the appearance from an appearance image. The method uses feature injection and spatially-aware normalization in the attention layers to align the generated image with the user-provided images. Ctrl-X supports arbitrary structure and appearance conditions and can be applied to any T2I and text-to-video (T2V) diffusion model. It achieves superior image quality and appearance transfer compared to existing methods and provides instant plug-and-play functionality. The method is efficient, with a 40-fold increase in inference speed compared to guidance-based methods. It also supports extension to video diffusion models and prompt-driven conditional generation. Extensive experiments demonstrate the effectiveness of Ctrl-X in structure preservation and appearance alignment.Ctrl-X is a training-free and guidance-free framework for text-to-image (T2I) diffusion models that enables structure and appearance control without additional training or guidance. It allows users to generate images that inherit the structure from a structure image and the appearance from an appearance image. The method uses feature injection and spatially-aware normalization in the attention layers to align the generated image with the user-provided images. Ctrl-X supports arbitrary structure and appearance conditions and can be applied to any T2I and text-to-video (T2V) diffusion model. It achieves superior image quality and appearance transfer compared to existing methods and provides instant plug-and-play functionality. The method is efficient, with a 40-fold increase in inference speed compared to guidance-based methods. It also supports extension to video diffusion models and prompt-driven conditional generation. Extensive experiments demonstrate the effectiveness of Ctrl-X in structure preservation and appearance alignment.
Reach us at info@study.space
Understanding Ctrl-X%3A Controlling Structure and Appearance for Text-To-Image Generation Without Guidance