Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance

Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance

11 Jun 2024 | Kuan Heng Lin, Sicheng Mo, Ben Klingher, Fangzhou Mu, Bolei Zhou
Ctrl-X is a novel framework designed to control the structure and appearance of text-to-image (T2I) generation without additional training or guidance. The method leverages feed-forward structure control and semantic-aware appearance transfer to align the generated image with a provided structure image and transfer the appearance from a user-input image. Ctrl-X supports a wide range of condition images, including natural images, ControlNet-supported conditions, and in-the-wild conditions, and can be applied to various T2I and text-to-video (T2V) models. Extensive experiments demonstrate that Ctrl-X outperforms existing methods in terms of image quality, appearance alignment, and structure preservation. The framework is training-free and guidance-free, achieving a 40-fold increase in inference speed compared to guidance-based methods. Ctrl-X also supports prompt-driven conditional generation, where the output image aligns with both the given text prompt and the structure from the structure image. The method's effectiveness is validated through qualitative and quantitative evaluations, showing superior performance in structure preservation and appearance alignment.Ctrl-X is a novel framework designed to control the structure and appearance of text-to-image (T2I) generation without additional training or guidance. The method leverages feed-forward structure control and semantic-aware appearance transfer to align the generated image with a provided structure image and transfer the appearance from a user-input image. Ctrl-X supports a wide range of condition images, including natural images, ControlNet-supported conditions, and in-the-wild conditions, and can be applied to various T2I and text-to-video (T2V) models. Extensive experiments demonstrate that Ctrl-X outperforms existing methods in terms of image quality, appearance alignment, and structure preservation. The framework is training-free and guidance-free, achieving a 40-fold increase in inference speed compared to guidance-based methods. Ctrl-X also supports prompt-driven conditional generation, where the output image aligns with both the given text prompt and the structure from the structure image. The method's effectiveness is validated through qualitative and quantitative evaluations, showing superior performance in structure preservation and appearance alignment.
Reach us at info@study.space