10 Jun 2024 | Yi Gu*, Zhendong Wang*, Yueqin Yin, Yujia Xie, and Mingyuan Zhou*
Diffusion-RPO is a novel method for aligning Text-to-Image (T2I) models with human preferences. It enhances diffusion model sampling steps and applies contrastive weighting to similar prompt-image pairs. The method leverages both identical and semantically related prompt-image pairs across different modalities. A new evaluation metric, style alignment, is introduced to address challenges in current human preference alignment evaluations, such as high costs, low reproducibility, and limited interpretability. Diffusion-RPO outperforms existing methods like Supervised Fine-Tuning and Diffusion-DPO in aligning images with human preferences and in style alignment tasks. The method uses a relative preference optimization approach, where the RPO loss is derived for diffusion models, simplifying the alignment process across each timestep. The CLIP encoder is used to project prompts and images into the same embedding space, enabling accurate similarity measurement between multi-modal pairs. The approach is validated through experiments on state-of-the-art T2I models, including Stable Diffusion 1.5 and XL-1.0, demonstrating superior performance in both human preference alignment and style alignment. The method introduces a new task called Style Alignment, which aims to align T2I models with specific styles, such as Van Gogh or sketch, identified as preferred samples within the dataset. The results show that Diffusion-RPO outperforms existing preference learning baselines in both automated evaluations of human preference and style alignment tasks. The method also addresses limitations in current evaluation metrics by introducing a more effective and reliable task for preference learning. The key contributions include adapting the RPO framework to diffusion-based T2I models, introducing a simplified step-wise denoising alignment loss and multi-modal re-weighting factors, and introducing Style Alignment as a new evaluation task for image preference learning. The method demonstrates that learning preferences across non-identical prompts significantly enhances the alignment of generated images with human preferences in T2I models.Diffusion-RPO is a novel method for aligning Text-to-Image (T2I) models with human preferences. It enhances diffusion model sampling steps and applies contrastive weighting to similar prompt-image pairs. The method leverages both identical and semantically related prompt-image pairs across different modalities. A new evaluation metric, style alignment, is introduced to address challenges in current human preference alignment evaluations, such as high costs, low reproducibility, and limited interpretability. Diffusion-RPO outperforms existing methods like Supervised Fine-Tuning and Diffusion-DPO in aligning images with human preferences and in style alignment tasks. The method uses a relative preference optimization approach, where the RPO loss is derived for diffusion models, simplifying the alignment process across each timestep. The CLIP encoder is used to project prompts and images into the same embedding space, enabling accurate similarity measurement between multi-modal pairs. The approach is validated through experiments on state-of-the-art T2I models, including Stable Diffusion 1.5 and XL-1.0, demonstrating superior performance in both human preference alignment and style alignment. The method introduces a new task called Style Alignment, which aims to align T2I models with specific styles, such as Van Gogh or sketch, identified as preferred samples within the dataset. The results show that Diffusion-RPO outperforms existing preference learning baselines in both automated evaluations of human preference and style alignment tasks. The method also addresses limitations in current evaluation metrics by introducing a more effective and reliable task for preference learning. The key contributions include adapting the RPO framework to diffusion-based T2I models, introducing a simplified step-wise denoising alignment loss and multi-modal re-weighting factors, and introducing Style Alignment as a new evaluation task for image preference learning. The method demonstrates that learning preferences across non-identical prompts significantly enhances the alignment of generated images with human preferences in T2I models.