DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations

DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations

12 Mar 2024 | Tianhao Qi, Shancheng Fang, Yanze Wu, Hongtao Xie, Jiawei Liu, Lang Chen, Qian He, Yongdong Zhang
**DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations** This paper addresses the issue of reduced text controllability in encoder-based methods for stylizing images using diffusion models. The proposed DEADiff method aims to decouple the style and semantics of reference images and improve text controllability. It introduces a dual decoupling representation extraction mechanism (DDRE) using Q-Formers to extract style and semantic representations from reference images. These representations are then injected into mutually exclusive subsets of cross-attention layers for better disentanglement. Additionally, a non-reconstructive learning method is introduced, where Q-Formers are trained using paired images with the same style or semantics, rather than reconstructing the reference image. This approach ensures that the model focuses on both style and text conditions, achieving optimal balance between style similarity and text controllability. The method is evaluated quantitatively and qualitatively, demonstrating superior performance compared to existing methods in terms of style similarity, text alignment, and image quality. The project page is available at https://tianhao-qi.github.io/DEADiff/.**DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations** This paper addresses the issue of reduced text controllability in encoder-based methods for stylizing images using diffusion models. The proposed DEADiff method aims to decouple the style and semantics of reference images and improve text controllability. It introduces a dual decoupling representation extraction mechanism (DDRE) using Q-Formers to extract style and semantic representations from reference images. These representations are then injected into mutually exclusive subsets of cross-attention layers for better disentanglement. Additionally, a non-reconstructive learning method is introduced, where Q-Formers are trained using paired images with the same style or semantics, rather than reconstructing the reference image. This approach ensures that the model focuses on both style and text conditions, achieving optimal balance between style similarity and text controllability. The method is evaluated quantitatively and qualitatively, demonstrating superior performance compared to existing methods in terms of style similarity, text alignment, and image quality. The project page is available at https://tianhao-qi.github.io/DEADiff/.
Reach us at info@study.space