CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

3 Jun 2024 | Dongzhi Jiang, Guanglu Song, Xiaoshi Wu, Renrui Zhang, Dazhong Shen, Zhuofan Zong, Yu Liu, Hongsheng Li
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching **Authors:** Dongzhi Jiang, Guanglu Song, Xiaoshi Wu, Renrui Zhang, Dazhong Shen, Zhuofan Zong, Yu Liu, Hongsheng Li **Institution:** CUHK MMLab, SenseTime Research, Shanghai AI Laboratory **Abstract:** Diffusion models have shown great success in text-to-image generation, but they struggle with misalignment between text prompts and generated images. This paper addresses two main causes of misalignment: concept ignorance and concept mismatching. To tackle these issues, CoMat is proposed, an end-to-end fine-tuning strategy for diffusion models that incorporates an image-to-text concept matching mechanism. The method introduces a novel image-to-text concept activation module to guide the diffusion model in revisiting ignored concepts and an attribute concentration module to ensure correct mapping of text conditions to image areas. Extensive experiments on three text-to-image alignment benchmarks demonstrate that CoMat significantly improves text-image alignment, outperforming baseline models like SDXL. **Key Contributions:** - CoMat: An end-to-end fine-tuning strategy for diffusion models to enhance text-image alignment. - Concept Activation Module: Guides the diffusion model to revisit ignored concepts using an image-to-text model. - Attribute Concentration Module: Ensures correct mapping of text conditions to image areas. - Experimental Results: CoMat significantly improves alignment in various scenarios, including object existence, attribute binding, and complex prompts. **Related Work:** - Attention-based methods: Modifies attention maps in UNet. - Planning-based methods: Obtains image layouts from user input or LLMs. - Feedback from image understanding models: Uses VQA models to refine images. **Preliminaries:** - Implementation on SDXL and SD1.5 diffusion models. - Training setup and hyperparameters. **Method:** - Concept Activation: Supervises the generated image using an image-to-text model. - Fidelity Preservation: Uses an adversarial loss to prevent the diffusion model from overfitting. - Mixed Latent Strategy: Injects real-world image latents to guide the learning process. - Attribute Concentration: Ensures correct mapping of entity tokens to their areas in the image. **Experiments:** - Baseline models: SD1.5 and SDXL. - Datasets: T2I-CompBench, TIFA, DPG-Bench. - Quantitative and Qualitative Results: Show significant improvements in alignment and photorealism. **Conclusion:** CoMat effectively addresses the misalignment problem in text-to-image generation by leveraging image-to-text concept matching. The method enhances the diffusion model's ability to follow text prompts, demonstrating superior performance in various alignment benchmarks.CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching **Authors:** Dongzhi Jiang, Guanglu Song, Xiaoshi Wu, Renrui Zhang, Dazhong Shen, Zhuofan Zong, Yu Liu, Hongsheng Li **Institution:** CUHK MMLab, SenseTime Research, Shanghai AI Laboratory **Abstract:** Diffusion models have shown great success in text-to-image generation, but they struggle with misalignment between text prompts and generated images. This paper addresses two main causes of misalignment: concept ignorance and concept mismatching. To tackle these issues, CoMat is proposed, an end-to-end fine-tuning strategy for diffusion models that incorporates an image-to-text concept matching mechanism. The method introduces a novel image-to-text concept activation module to guide the diffusion model in revisiting ignored concepts and an attribute concentration module to ensure correct mapping of text conditions to image areas. Extensive experiments on three text-to-image alignment benchmarks demonstrate that CoMat significantly improves text-image alignment, outperforming baseline models like SDXL. **Key Contributions:** - CoMat: An end-to-end fine-tuning strategy for diffusion models to enhance text-image alignment. - Concept Activation Module: Guides the diffusion model to revisit ignored concepts using an image-to-text model. - Attribute Concentration Module: Ensures correct mapping of text conditions to image areas. - Experimental Results: CoMat significantly improves alignment in various scenarios, including object existence, attribute binding, and complex prompts. **Related Work:** - Attention-based methods: Modifies attention maps in UNet. - Planning-based methods: Obtains image layouts from user input or LLMs. - Feedback from image understanding models: Uses VQA models to refine images. **Preliminaries:** - Implementation on SDXL and SD1.5 diffusion models. - Training setup and hyperparameters. **Method:** - Concept Activation: Supervises the generated image using an image-to-text model. - Fidelity Preservation: Uses an adversarial loss to prevent the diffusion model from overfitting. - Mixed Latent Strategy: Injects real-world image latents to guide the learning process. - Attribute Concentration: Ensures correct mapping of entity tokens to their areas in the image. **Experiments:** - Baseline models: SD1.5 and SDXL. - Datasets: T2I-CompBench, TIFA, DPG-Bench. - Quantitative and Qualitative Results: Show significant improvements in alignment and photorealism. **Conclusion:** CoMat effectively addresses the misalignment problem in text-to-image generation by leveraging image-to-text concept matching. The method enhances the diffusion model's ability to follow text prompts, demonstrating superior performance in various alignment benchmarks.
Reach us at info@study.space