Tuning-Free Image Customization with Image and Text Guidance

Tuning-Free Image Customization with Image and Text Guidance

19 Mar 2024 | Pengzhi Li*, Qiang Nie*, Ying Chen, Xi Jiang, Kai Wu, Yuhuan Lin, Yong Liu, Jinlong Peng, Chengjie Wang, Feng Zheng†
This paper introduces a tuning-free framework for image customization that simultaneously uses text and image guidance to edit specific regions of an image. The method enables precise editing of specific image regions within seconds, preserving the semantic features of the reference image subject while allowing modification of detailed attributes based on text descriptions. The approach proposes an innovative attention blending strategy that blends self-attention features in the UNet decoder during the denoising process. This method outperforms previous approaches in both human and quantitative evaluations, providing an efficient solution for various practical applications such as image synthesis, design, and creative photography. The framework uses text descriptions and reference images as guidance to customize the target region(s) of the image in a tuning-free manner. It employs blended self-attention instead of original self-attention injection throughout the denoising process, allowing the retention of generated subject features while achieving text-driven capability for attribute modification. The method achieves high-fidelity image generation driven by both text and image, overcoming the limitations of previous methods that rely solely on text or image guidance. The proposed method is evaluated against existing state-of-the-art methods, demonstrating superior performance in terms of text-driven capabilities, image reconstruction quality, and attribute editing accuracy. The method is also applied to creative photography and graphic design, showing its effectiveness in generating realistic and harmonious images with consistent lighting and environmental features. The method is also tested in various ablation studies, showing the effectiveness of the attention blending strategy in achieving precise editing while maintaining the subject's features. The method is limited in generating images from multiple perspectives due to the self-attention blending mechanism without tuning.This paper introduces a tuning-free framework for image customization that simultaneously uses text and image guidance to edit specific regions of an image. The method enables precise editing of specific image regions within seconds, preserving the semantic features of the reference image subject while allowing modification of detailed attributes based on text descriptions. The approach proposes an innovative attention blending strategy that blends self-attention features in the UNet decoder during the denoising process. This method outperforms previous approaches in both human and quantitative evaluations, providing an efficient solution for various practical applications such as image synthesis, design, and creative photography. The framework uses text descriptions and reference images as guidance to customize the target region(s) of the image in a tuning-free manner. It employs blended self-attention instead of original self-attention injection throughout the denoising process, allowing the retention of generated subject features while achieving text-driven capability for attribute modification. The method achieves high-fidelity image generation driven by both text and image, overcoming the limitations of previous methods that rely solely on text or image guidance. The proposed method is evaluated against existing state-of-the-art methods, demonstrating superior performance in terms of text-driven capabilities, image reconstruction quality, and attribute editing accuracy. The method is also applied to creative photography and graphic design, showing its effectiveness in generating realistic and harmonious images with consistent lighting and environmental features. The method is also tested in various ablation studies, showing the effectiveness of the attention blending strategy in achieving precise editing while maintaining the subject's features. The method is limited in generating images from multiple perspectives due to the self-attention blending mechanism without tuning.
Reach us at info@study.space
[slides and audio] Tuning-Free Image Customization with Image and Text Guidance