DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations

DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations

12 Mar 2024 | Tianhao Qi, Shancheng Fang, Yanze Wu, Hongtao Xie, Jiawei Liu, Lang Chen, Qian He, Yongdong Zhang
DEADiff is an efficient stylization diffusion model that addresses the issue of text controllability loss in encoder-based methods. The model decouples the style and semantics of reference images using a dual decoupling representation extraction mechanism (DDRE) and a disentangled conditioning mechanism. DDRE employs Q-Formers to extract style and semantic representations from reference images, which are then injected into mutually exclusive subsets of cross-attention layers for better disentanglement. The disentangled conditioning mechanism allows different parts of the cross-attention layers to handle image style and semantic representation separately, reducing semantic conflicts. Additionally, a non-reconstruction training paradigm is used, where Q-Formers are trained using paired images rather than the identical target. This approach enables the model to maintain text controllability while achieving high style similarity to the reference image. DEADiff outperforms existing methods in terms of style similarity, text alignment, and image quality, as demonstrated both quantitatively and qualitatively. The model is efficient and can be applied to various tasks, including stylization of reference semantics, style mixing, and switching between different base T2I models. DEADiff is also compatible with existing frameworks like ControlNet and DreamBooth/LoRA, making it a versatile tool for text-to-image generation.DEADiff is an efficient stylization diffusion model that addresses the issue of text controllability loss in encoder-based methods. The model decouples the style and semantics of reference images using a dual decoupling representation extraction mechanism (DDRE) and a disentangled conditioning mechanism. DDRE employs Q-Formers to extract style and semantic representations from reference images, which are then injected into mutually exclusive subsets of cross-attention layers for better disentanglement. The disentangled conditioning mechanism allows different parts of the cross-attention layers to handle image style and semantic representation separately, reducing semantic conflicts. Additionally, a non-reconstruction training paradigm is used, where Q-Formers are trained using paired images rather than the identical target. This approach enables the model to maintain text controllability while achieving high style similarity to the reference image. DEADiff outperforms existing methods in terms of style similarity, text alignment, and image quality, as demonstrated both quantitatively and qualitatively. The model is efficient and can be applied to various tasks, including stylization of reference semantics, style mixing, and switching between different base T2I models. DEADiff is also compatible with existing frameworks like ControlNet and DreamBooth/LoRA, making it a versatile tool for text-to-image generation.
Reach us at info@study.space