Understanding Implicit Style-Content Separation using B-LoRA

The paper introduces B-LoRA, a method that leverages Low-Rank Adaptation (LoRA) to implicitly separate the style and content components of a single image, facilitating various image stylization tasks. By analyzing the architecture of Stable Diffusion XL (SDXL) combined with LoRA, the authors find that jointly learning the LoRA weights of two specific blocks (referred to as B-LoRAs) achieves style-content separation that cannot be achieved by training each B-LoRA independently. This approach allows for significant improvements in style manipulation and overcomes overfitting issues often associated with model fine-tuning. The trained B-LoRAs can be used as independent components for various image stylization tasks, including image style transfer, text-based image stylization, consistent style generation, and style-content mixing. The method involves optimizing the LoRA weights of two specific transformer blocks within the SDXL model to capture the content and style of the input image. The authors demonstrate that these blocks, identified as $W_0^2$ and $W_0^4$ for content and $W_0^5$ for color, respectively, are crucial for achieving effective style-content separation. By training only these two blocks, the method achieves a full reconstruction of the input image while preserving its content and style. The paper also includes a user study and quantitative comparisons with alternative approaches, showing that B-LoRA outperforms existing methods in terms of style alignment and content preservation. However, the method has limitations, such as sub-optimal identity preservation due to color separation and challenges in capturing content in complex scenes. Future work could explore more concrete separation techniques and extend the method to handle multiple objects or styles.The paper introduces B-LoRA, a method that leverages Low-Rank Adaptation (LoRA) to implicitly separate the style and content components of a single image, facilitating various image stylization tasks. By analyzing the architecture of Stable Diffusion XL (SDXL) combined with LoRA, the authors find that jointly learning the LoRA weights of two specific blocks (referred to as B-LoRAs) achieves style-content separation that cannot be achieved by training each B-LoRA independently. This approach allows for significant improvements in style manipulation and overcomes overfitting issues often associated with model fine-tuning. The trained B-LoRAs can be used as independent components for various image stylization tasks, including image style transfer, text-based image stylization, consistent style generation, and style-content mixing. The method involves optimizing the LoRA weights of two specific transformer blocks within the SDXL model to capture the content and style of the input image. The authors demonstrate that these blocks, identified as $W_0^2$ and $W_0^4$ for content and $W_0^5$ for color, respectively, are crucial for achieving effective style-content separation. By training only these two blocks, the method achieves a full reconstruction of the input image while preserving its content and style. The paper also includes a user study and quantitative comparisons with alternative approaches, showing that B-LoRA outperforms existing methods in terms of style alignment and content preservation. However, the method has limitations, such as sub-optimal identity preservation due to color separation and challenges in capturing content in complex scenes. Future work could explore more concrete separation techniques and extend the method to handle multiple objects or styles.

Implicit Style-Content Separation using B-LoRA

21 Mar 2024 | Yarden Frenkel, Yael Vinker, Ariel Shamir, Daniel Cohen-Or