[slides and audio] Attention Calibration for Disentangled Text-to-Image Personalization

The paper "Attention Calibration for Disentangled Text-to-Image Personalization" addresses the challenge of generating personalized images from a single reference image by capturing multiple, novel concepts. The authors propose a method called *DisenDiff* (Disentangled Diffusion) to improve the concept-level understanding of text-to-image (T2I) models. Key contributions include: 1. **Attention Calibration Mechanism**: The method introduces new learnable modifiers bound with classes to capture attributes of multiple concepts. These modifiers are separated and strengthened using the cross-attention operation, ensuring comprehensive and self-contained concepts. 2. **Suppression Technique**: This technique suppresses the attention activation of different classes to mitigate mutual influence among concepts, enhancing the independence of learned concepts. 3. **Evaluation**: The proposed method outperforms state-of-the-art approaches in both qualitative and quantitative evaluations, demonstrating superior performance in generating customized images with learned concepts. 4. **Compatibility**: *DisenDiff* is compatible with LoRA and inpainting pipelines, enabling more interactive experiences. The paper also includes a detailed experimental setup, comparisons with existing methods, and ablation studies to validate the effectiveness of each component. The results show that *DisenDiff* achieves high visual fidelity and maintains strong text editing effectiveness, making it a promising approach for personalized T2I generation.The paper "Attention Calibration for Disentangled Text-to-Image Personalization" addresses the challenge of generating personalized images from a single reference image by capturing multiple, novel concepts. The authors propose a method called *DisenDiff* (Disentangled Diffusion) to improve the concept-level understanding of text-to-image (T2I) models. Key contributions include: 1. **Attention Calibration Mechanism**: The method introduces new learnable modifiers bound with classes to capture attributes of multiple concepts. These modifiers are separated and strengthened using the cross-attention operation, ensuring comprehensive and self-contained concepts. 2. **Suppression Technique**: This technique suppresses the attention activation of different classes to mitigate mutual influence among concepts, enhancing the independence of learned concepts. 3. **Evaluation**: The proposed method outperforms state-of-the-art approaches in both qualitative and quantitative evaluations, demonstrating superior performance in generating customized images with learned concepts. 4. **Compatibility**: *DisenDiff* is compatible with LoRA and inpainting pipelines, enabling more interactive experiences. The paper also includes a detailed experimental setup, comparisons with existing methods, and ablation studies to validate the effectiveness of each component. The results show that *DisenDiff* achieves high visual fidelity and maintains strong text editing effectiveness, making it a promising approach for personalized T2I generation.

Attention Calibration for Disentangled Text-to-Image Personalization

11 Apr 2024 | Yanbing Zhang, Mengping Yang, Qin Zhou, Zhe Wang