This paper addresses the challenge of inter-concept visual confusion in text-guided diffusion models (TGDMs) when generating customized concepts, especially when there are limited user-provided visual examples. The authors propose CLIF (Contrastive Image-Language Fine-tuning), a simple yet effective method that enhances the contrast between textual embeddings of customized concepts during the text encoder stage. By contrastively fine-tuning the text encoder using over-segmented visual data, CLIF effectively reduces confusion in concept embeddings, leading to more accurate and non-confusing multi-concept generation.
The key idea is to first fine-tune the text encoder to obtain contrastive concept embeddings, which are then used to train the text-to-image decoder. This two-stage approach ensures that the textual embeddings of customized concepts are well-separated, reducing the likelihood of confusion during image generation. The method involves creating augmented data through global, regional, and mixed augmentation techniques to improve identity preservation, attribute binding, and concept attendance.
The authors demonstrate the effectiveness of CLIF through extensive experiments, comparing it with state-of-the-art methods such as Text-Inversion, Custom Diffusion, Dreambooth, and Mix-of-Show. The results show that CLIF significantly reduces confusion in multi-concept generation, especially in cases involving complex interactions and spatial clutter. CLIF achieves superior image alignment and text alignment, and it is able to generate high-quality images with multiple customized concepts without additional spatial constraints.
The paper also discusses the limitations of CLIF, including the need for a larger dataset to generate more than two objects and the potential for dominant bias in multi-concept generation. Overall, CLIF provides a robust solution to the problem of concept confusion in TGDMs, making it a powerful tool for customized concept generation.This paper addresses the challenge of inter-concept visual confusion in text-guided diffusion models (TGDMs) when generating customized concepts, especially when there are limited user-provided visual examples. The authors propose CLIF (Contrastive Image-Language Fine-tuning), a simple yet effective method that enhances the contrast between textual embeddings of customized concepts during the text encoder stage. By contrastively fine-tuning the text encoder using over-segmented visual data, CLIF effectively reduces confusion in concept embeddings, leading to more accurate and non-confusing multi-concept generation.
The key idea is to first fine-tune the text encoder to obtain contrastive concept embeddings, which are then used to train the text-to-image decoder. This two-stage approach ensures that the textual embeddings of customized concepts are well-separated, reducing the likelihood of confusion during image generation. The method involves creating augmented data through global, regional, and mixed augmentation techniques to improve identity preservation, attribute binding, and concept attendance.
The authors demonstrate the effectiveness of CLIF through extensive experiments, comparing it with state-of-the-art methods such as Text-Inversion, Custom Diffusion, Dreambooth, and Mix-of-Show. The results show that CLIF significantly reduces confusion in multi-concept generation, especially in cases involving complex interactions and spatial clutter. CLIF achieves superior image alignment and text alignment, and it is able to generate high-quality images with multiple customized concepts without additional spatial constraints.
The paper also discusses the limitations of CLIF, including the need for a larger dataset to generate more than two objects and the potential for dominant bias in multi-concept generation. Overall, CLIF provides a robust solution to the problem of concept confusion in TGDMs, making it a powerful tool for customized concept generation.