14 Mar 2025 | Stefan Andreas Baumann, Felix Krause, Michael Neumayr, Nick Stracke, Melvin Sevi, Vincent Tao Hu, Björn Ommer
This paper addresses the challenge of providing fine-grained, subject-specific control over attributes in text-to-image (T2I) diffusion models. While existing methods offer either detailed localization or global fine-grained control, this work introduces two methods to identify token-level directions within CLIP text embeddings that enable continuous, subject-specific modulation of high-level attributes. The first method is an optimization-free approach that contrasts text prompts to identify attribute-specific directions, while the second method uses a learning-based approach to identify more robust directions through backpropagation of semantic concepts. These methods allow for the augmentation of prompt text inputs, enabling fine-grained control over multiple attributes of individual subjects without modifying the diffusion model itself. The paper demonstrates the effectiveness of these methods through experiments on various attributes and subjects, showing superior subject-specificity, disentanglement, and fine-grained control compared to existing approaches. The proposed methods also exhibit strong generalization capabilities, allowing for zero-shot transfer to different models and non-diffusion models.This paper addresses the challenge of providing fine-grained, subject-specific control over attributes in text-to-image (T2I) diffusion models. While existing methods offer either detailed localization or global fine-grained control, this work introduces two methods to identify token-level directions within CLIP text embeddings that enable continuous, subject-specific modulation of high-level attributes. The first method is an optimization-free approach that contrasts text prompts to identify attribute-specific directions, while the second method uses a learning-based approach to identify more robust directions through backpropagation of semantic concepts. These methods allow for the augmentation of prompt text inputs, enabling fine-grained control over multiple attributes of individual subjects without modifying the diffusion model itself. The paper demonstrates the effectiveness of these methods through experiments on various attributes and subjects, showing superior subject-specificity, disentanglement, and fine-grained control compared to existing approaches. The proposed methods also exhibit strong generalization capabilities, allowing for zero-shot transfer to different models and non-diffusion models.