Understanding Non-confusing Generation of Customized Concepts in Diffusion Models

The paper addresses the challenge of inter-concept visual confusion in text-guided diffusion models (TGDMs) for generating customized concepts. Existing methods primarily focus on fine-tuning the text-to-image decoder, neglecting the importance of the text encoder in contrastive image-language pre-training (CLIP). To tackle this, the authors propose CLIF (Contrastive Language-Image Fine-tuning), which involves fine-tuning the text encoder using contrastive learning to improve the separation of concept embeddings. This approach helps prevent confusion in multi-concept generation by ensuring that different concepts are well-segregated in the embedding space. The method is evaluated on a dataset of 18 user-provided characters, demonstrating superior performance in generating non-confusing images with complex interactions. The paper also includes ablation studies to validate the effectiveness of each component of CLIF, showing that global, regional, and mixed augmentations significantly improve the generation quality. Overall, CLIF enhances the ability to generate diverse and accurate compositions of customized concepts, making it a valuable tool for creative and practical applications.The paper addresses the challenge of inter-concept visual confusion in text-guided diffusion models (TGDMs) for generating customized concepts. Existing methods primarily focus on fine-tuning the text-to-image decoder, neglecting the importance of the text encoder in contrastive image-language pre-training (CLIP). To tackle this, the authors propose CLIF (Contrastive Language-Image Fine-tuning), which involves fine-tuning the text encoder using contrastive learning to improve the separation of concept embeddings. This approach helps prevent confusion in multi-concept generation by ensuring that different concepts are well-segregated in the embedding space. The method is evaluated on a dataset of 18 user-provided characters, demonstrating superior performance in generating non-confusing images with complex interactions. The paper also includes ablation studies to validate the effectiveness of each component of CLIF, showing that global, regional, and mixed augmentations significantly improve the generation quality. Overall, CLIF enhances the ability to generate diverse and accurate compositions of customized concepts, making it a valuable tool for creative and practical applications.

Non-confusing Generation of Customized Concepts in Diffusion Models

2024 | Wang Lin, Jingyuan Chen, Jiaxin Shi, Yichen Zhu, Chen Liang, Junzhong Miao, Tao Jin, Zhou Zhao, Fei Wu, Shuicheng Yan, Hanwang Zhang