5 Apr 2024 | Gihyun Kwon¹ Simon Jenni² Dingzeyu Li² Joon-Young Lee² Jong Chul Ye¹ Fabian Caba Heilbron²
Concept Weaver is a method for generating images that incorporate multiple custom concepts. The method involves creating a template image aligned with the semantics of input prompts and then personalizing the template using a concept fusion strategy. The fusion strategy incorporates the appearance of the target concepts into the template image while retaining its structural details. The results indicate that the method can generate multiple custom concepts with higher identity fidelity compared to alternative approaches. Furthermore, the method is shown to seamlessly handle more than two concepts and closely follow the semantic meaning of the input prompt without blending appearances across different subjects.
The method is based on a cascading generation process. It starts by personalizing text-to-image models for each concept. Then, a non-personalized 'template image' is selected using the given prompt. The method extracts latent representations from this template to aid in later editing. The specific regions of the template image that correspond to the target subjects are identified and isolated. Finally, the latent representations, targeted spatial regions, and personalized models are combined to reconstruct the template image, infusing it with the specified concepts.
The method is evaluated using CLIP scores, which measure the alignment between the generated images and the text prompts. The results show that the method outperforms baseline approaches in both text-similarity and image-similarity scores, indicating better quality in both text semantic alignment and concept appearance preservation. The method is also shown to be effective in generating images with multiple concepts, including complex interactions between them. Additionally, the method can be applied to customize real images and be easily extended to efficient LoRA fine-tuning.Concept Weaver is a method for generating images that incorporate multiple custom concepts. The method involves creating a template image aligned with the semantics of input prompts and then personalizing the template using a concept fusion strategy. The fusion strategy incorporates the appearance of the target concepts into the template image while retaining its structural details. The results indicate that the method can generate multiple custom concepts with higher identity fidelity compared to alternative approaches. Furthermore, the method is shown to seamlessly handle more than two concepts and closely follow the semantic meaning of the input prompt without blending appearances across different subjects.
The method is based on a cascading generation process. It starts by personalizing text-to-image models for each concept. Then, a non-personalized 'template image' is selected using the given prompt. The method extracts latent representations from this template to aid in later editing. The specific regions of the template image that correspond to the target subjects are identified and isolated. Finally, the latent representations, targeted spatial regions, and personalized models are combined to reconstruct the template image, infusing it with the specified concepts.
The method is evaluated using CLIP scores, which measure the alignment between the generated images and the text prompts. The results show that the method outperforms baseline approaches in both text-similarity and image-similarity scores, indicating better quality in both text semantic alignment and concept appearance preservation. The method is also shown to be effective in generating images with multiple concepts, including complex interactions between them. Additionally, the method can be applied to customize real images and be easily extended to efficient LoRA fine-tuning.