Understanding CoMat%3A Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

CoMat is an end-to-end diffusion model fine-tuning strategy that enhances text-to-image alignment by integrating image-to-text concept matching. The method addresses two main issues: concept ignorance and concept mismapping. Concept ignorance occurs when the diffusion model fails to recognize certain concepts in the text prompt, while concept mismapping refers to the incorrect mapping of text conditions to image areas. To tackle these challenges, CoMat introduces a concept activation module that guides the diffusion model to revisit ignored concepts and an attribute concentration module that ensures attributes are correctly mapped to their corresponding image regions. The concept activation module uses an image-to-text model to supervise the generated image and identify missing concepts, while the attribute concentration module ensures that attributes are mapped to the correct areas in the image. The method also incorporates a fidelity preservation module and a mixed latent strategy to maintain the generation capability of the diffusion model. Extensive experiments on three text-to-image alignment benchmarks demonstrate that CoMat significantly improves text-image alignment, outperforming baseline models in various scenarios, including object existence, attribute binding, relationship, and complex prompts. The method is efficient and can be integrated with other approaches that leverage external knowledge. CoMat achieves superior performance in generating images that are more faithful to the text prompts, as demonstrated by qualitative results and user preference studies.CoMat is an end-to-end diffusion model fine-tuning strategy that enhances text-to-image alignment by integrating image-to-text concept matching. The method addresses two main issues: concept ignorance and concept mismapping. Concept ignorance occurs when the diffusion model fails to recognize certain concepts in the text prompt, while concept mismapping refers to the incorrect mapping of text conditions to image areas. To tackle these challenges, CoMat introduces a concept activation module that guides the diffusion model to revisit ignored concepts and an attribute concentration module that ensures attributes are correctly mapped to their corresponding image regions. The concept activation module uses an image-to-text model to supervise the generated image and identify missing concepts, while the attribute concentration module ensures that attributes are mapped to the correct areas in the image. The method also incorporates a fidelity preservation module and a mixed latent strategy to maintain the generation capability of the diffusion model. Extensive experiments on three text-to-image alignment benchmarks demonstrate that CoMat significantly improves text-image alignment, outperforming baseline models in various scenarios, including object existence, attribute binding, relationship, and complex prompts. The method is efficient and can be integrated with other approaches that leverage external knowledge. CoMat achieves superior performance in generating images that are more faithful to the text prompts, as demonstrated by qualitative results and user preference studies.

CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

3 Jun 2024 | Dongzhi Jiang, Guanglu Song, Xiaohu Wu, Renrui Zhang, Dazhong Shen, Zhuofan Zong, Yu Liu, Hongsheng Li