MC²: Multi-concept Guidance for Customized Multi-concept Generation

MC²: Multi-concept Guidance for Customized Multi-concept Generation

12 Apr 2024 | Jiaxiu Jiang, Yabo Zhang, Kailai Feng, Xiaohu Wu, and Wangmeng Zuo
MC²: Multi-concept Guidance for Customized Multi-concept Generation This paper introduces MC², a novel method for customized multi-concept generation that enables the seamless integration of various heterogeneous single-concept customized models. MC² improves flexibility and fidelity by decoupling model architecture requirements through inference time optimization, allowing the integration of different single-concept customized models. It adaptively refines attention weights between visual and textual tokens, directing image regions to focus on their associated words while diminishing the impact of irrelevant ones. Extensive experiments show that MC² surpasses previous methods that require additional training in terms of consistency with input prompts and reference images. MC² can also be extended to enhance the compositional capabilities of text-to-image generation, yielding appealing results. The code is publicly available at https://github.com/JIANGJiaXiu/MC-2. Keywords: Text-to-image generation · Customized multi-concept generation · Compositional generation Customized multi-concept generation aims to synthesize instantiations of user-specified concepts. Existing methods face limitations in flexibility and fidelity when extending to multiple customized concepts. MC² addresses these issues by enabling the integration of various single-concept customized models without additional training. It uses multi-concept guidance (MCG) to adaptively refine attention weights between visual and textual tokens, directing image regions to focus on their associated words while diminishing the impact of irrelevant ones. MC² can be extended to enhance the compositional generation ability of existing text-to-image diffusion models. The method is evaluated on the CustomConcept101 dataset and benchmark datasets for compositional generation. Quantitative evaluations show that MC² outperforms existing methods in terms of subject fidelity and prompt fidelity. A user study also confirms that MC² outperforms the baselines in terms of text alignment and image alignment. The method is also evaluated on ablation studies, showing that the proposed loss terms improve the fidelity to the reference images. The method is implemented using the Stable Diffusion v1-5 model, with LoRA as the single-concept customized model. The method is shown to be effective in generating images with multiple customized concepts, without confusing the attributes belonging to each object.MC²: Multi-concept Guidance for Customized Multi-concept Generation This paper introduces MC², a novel method for customized multi-concept generation that enables the seamless integration of various heterogeneous single-concept customized models. MC² improves flexibility and fidelity by decoupling model architecture requirements through inference time optimization, allowing the integration of different single-concept customized models. It adaptively refines attention weights between visual and textual tokens, directing image regions to focus on their associated words while diminishing the impact of irrelevant ones. Extensive experiments show that MC² surpasses previous methods that require additional training in terms of consistency with input prompts and reference images. MC² can also be extended to enhance the compositional capabilities of text-to-image generation, yielding appealing results. The code is publicly available at https://github.com/JIANGJiaXiu/MC-2. Keywords: Text-to-image generation · Customized multi-concept generation · Compositional generation Customized multi-concept generation aims to synthesize instantiations of user-specified concepts. Existing methods face limitations in flexibility and fidelity when extending to multiple customized concepts. MC² addresses these issues by enabling the integration of various single-concept customized models without additional training. It uses multi-concept guidance (MCG) to adaptively refine attention weights between visual and textual tokens, directing image regions to focus on their associated words while diminishing the impact of irrelevant ones. MC² can be extended to enhance the compositional generation ability of existing text-to-image diffusion models. The method is evaluated on the CustomConcept101 dataset and benchmark datasets for compositional generation. Quantitative evaluations show that MC² outperforms existing methods in terms of subject fidelity and prompt fidelity. A user study also confirms that MC² outperforms the baselines in terms of text alignment and image alignment. The method is also evaluated on ablation studies, showing that the proposed loss terms improve the fidelity to the reference images. The method is implemented using the Stable Diffusion v1-5 model, with LoRA as the single-concept customized model. The method is shown to be effective in generating images with multiple customized concepts, without confusing the attributes belonging to each object.
Reach us at info@study.space