Continuous, Subject-Specific Attribute Control in T2I Models by Identifying Semantic Directions

Continuous, Subject-Specific Attribute Control in T2I Models by Identifying Semantic Directions

14 Mar 2025 | Stefan Andreas Baumann, Felix Krause, Michael Neumayr, Nick Stracke, Melvin Sevi, Vincent Tao Hu, Björn Ommer
This paper introduces a method for continuous, subject-specific attribute control in text-to-image (T2I) diffusion models. The key idea is to identify semantic directions in CLIP text embeddings that enable fine-grained, subject-specific control of attributes in generated images. The authors propose two methods to identify these directions: a simple, optimization-free technique and a learning-based approach that uses the T2I model to characterize semantic concepts more specifically. These methods allow the augmentation of the prompt text input, enabling fine-grained control over multiple attributes of individual subjects simultaneously, without requiring any modifications to the diffusion model itself. This approach offers a unified solution that fills the gap between global and localized control, providing competitive flexibility and precision in text-guided image generation. The paper also presents experiments that demonstrate the effectiveness of the proposed method in various settings, including image generation and real image editing. The results show that the method achieves fine-grained, subject-specific control over attributes, which is not possible with existing methods. The method is also shown to generalize well to different models and can be applied to non-diffusion models. The authors conclude that their method provides a powerful way to control the image generation process in T2I diffusion models by leveraging the token-wise CLIP text embedding space.This paper introduces a method for continuous, subject-specific attribute control in text-to-image (T2I) diffusion models. The key idea is to identify semantic directions in CLIP text embeddings that enable fine-grained, subject-specific control of attributes in generated images. The authors propose two methods to identify these directions: a simple, optimization-free technique and a learning-based approach that uses the T2I model to characterize semantic concepts more specifically. These methods allow the augmentation of the prompt text input, enabling fine-grained control over multiple attributes of individual subjects simultaneously, without requiring any modifications to the diffusion model itself. This approach offers a unified solution that fills the gap between global and localized control, providing competitive flexibility and precision in text-guided image generation. The paper also presents experiments that demonstrate the effectiveness of the proposed method in various settings, including image generation and real image editing. The results show that the method achieves fine-grained, subject-specific control over attributes, which is not possible with existing methods. The method is also shown to generalize well to different models and can be applied to non-diffusion models. The authors conclude that their method provides a powerful way to control the image generation process in T2I diffusion models by leveraging the token-wise CLIP text embedding space.
Reach us at info@study.space