Attention Calibration for Disentangled Text-to-Image Personalization

Attention Calibration for Disentangled Text-to-Image Personalization

2024 | Yanbing Zhang, Mengping Yang, Qin Zhou, Zhe Wang
This paper proposes DisenDiff, a novel personalized text-to-image (T2I) model that enables the generation of customized images based on multiple concepts from a single reference image. The key challenge addressed is the ability to disentangle and generate images for multiple concepts without compromising visual fidelity or identity preservation. DisenDiff introduces an attention calibration mechanism to improve the model's understanding of concepts at the level of cross-attention maps. The method involves introducing new learnable modifiers bound with classes to capture attributes of multiple concepts, separating and strengthening classes based on cross-attention operations, and suppressing attention activation to mitigate mutual influence among concepts. The proposed method, DisenDiff, can learn disentangled multiple concepts from a single image and produce novel customized images with learned concepts. The model outperforms the current state of the art in both qualitative and quantitative evaluations. Additionally, the proposed techniques are compatible with LoRA and inpainting pipelines, enabling more interactive experiences. The method is evaluated on various datasets, demonstrating its effectiveness in generating images that maintain high fidelity to the input image while accurately conveying the target text. The results show that DisenDiff achieves the highest image-alignment scores and maintains strong text-editing effectiveness. The method is also applied to image inpainting and extending to three concepts, demonstrating its flexibility and effectiveness in various tasks. The paper concludes that DisenDiff provides a significant improvement in generating images with multiple concepts from a single reference image, while addressing the challenges of overfitting and maintaining visual consistency.This paper proposes DisenDiff, a novel personalized text-to-image (T2I) model that enables the generation of customized images based on multiple concepts from a single reference image. The key challenge addressed is the ability to disentangle and generate images for multiple concepts without compromising visual fidelity or identity preservation. DisenDiff introduces an attention calibration mechanism to improve the model's understanding of concepts at the level of cross-attention maps. The method involves introducing new learnable modifiers bound with classes to capture attributes of multiple concepts, separating and strengthening classes based on cross-attention operations, and suppressing attention activation to mitigate mutual influence among concepts. The proposed method, DisenDiff, can learn disentangled multiple concepts from a single image and produce novel customized images with learned concepts. The model outperforms the current state of the art in both qualitative and quantitative evaluations. Additionally, the proposed techniques are compatible with LoRA and inpainting pipelines, enabling more interactive experiences. The method is evaluated on various datasets, demonstrating its effectiveness in generating images that maintain high fidelity to the input image while accurately conveying the target text. The results show that DisenDiff achieves the highest image-alignment scores and maintains strong text-editing effectiveness. The method is also applied to image inpainting and extending to three concepts, demonstrating its flexibility and effectiveness in various tasks. The paper concludes that DisenDiff provides a significant improvement in generating images with multiple concepts from a single reference image, while addressing the challenges of overfitting and maintaining visual consistency.
Reach us at info@study.space