20 Jan 2024 | Yinan Zhang, Eric Tzeng, Yilun Du, Dmitry Kislyuk
This paper presents a scalable reinforcement learning (RL) framework for improving text-to-image diffusion models by aligning them with human preferences, fairness, and compositional diversity. The authors propose a large-scale RL training algorithm that can be applied across millions of prompts and multiple reward functions. The method uses a policy gradient approach with multi-step Markov decision processes (MDPs) to optimize diffusion models for various objectives, including human aesthetic preference, fairness, and object composition. The framework also incorporates distribution-based reward functions to enhance output diversity and multi-task joint training to optimize for multiple objectives simultaneously.
The authors evaluate their approach on three reward functions: human preference, fairness/diversity, and compositionality. They use an open-source reward model, ImageReward, trained on human preference data to align diffusion models with human preferences. They also develop a distribution-level reward function based on statistical parity to promote fairness and diversity in generated images. For compositionality, they use an auxiliary object detector to evaluate the accuracy of object placement in generated images.
The results show that their approach significantly outperforms existing methods in aligning diffusion models with human preferences, generating images that are preferred by humans 80.3% of the time over those from the base SD model. The method also improves the composition and diversity of generated samples. The authors also demonstrate that their approach can reduce skintone bias in generated images and improve the accuracy of object composition in generated images.
The paper compares their method with several baseline approaches, including ReFL, RAFT, DRaFT, and Reward-weighted, and shows that their method achieves better performance across multiple metrics. They also show that their method is more robust to reward hacking, a phenomenon where models optimize for a single reward function at the expense of overall performance. The authors conclude that their approach provides a scalable and effective way to improve diffusion models for a wide range of tasks, including generating images that align with human preferences, are fair and diverse, and accurately represent object compositions.This paper presents a scalable reinforcement learning (RL) framework for improving text-to-image diffusion models by aligning them with human preferences, fairness, and compositional diversity. The authors propose a large-scale RL training algorithm that can be applied across millions of prompts and multiple reward functions. The method uses a policy gradient approach with multi-step Markov decision processes (MDPs) to optimize diffusion models for various objectives, including human aesthetic preference, fairness, and object composition. The framework also incorporates distribution-based reward functions to enhance output diversity and multi-task joint training to optimize for multiple objectives simultaneously.
The authors evaluate their approach on three reward functions: human preference, fairness/diversity, and compositionality. They use an open-source reward model, ImageReward, trained on human preference data to align diffusion models with human preferences. They also develop a distribution-level reward function based on statistical parity to promote fairness and diversity in generated images. For compositionality, they use an auxiliary object detector to evaluate the accuracy of object placement in generated images.
The results show that their approach significantly outperforms existing methods in aligning diffusion models with human preferences, generating images that are preferred by humans 80.3% of the time over those from the base SD model. The method also improves the composition and diversity of generated samples. The authors also demonstrate that their approach can reduce skintone bias in generated images and improve the accuracy of object composition in generated images.
The paper compares their method with several baseline approaches, including ReFL, RAFT, DRaFT, and Reward-weighted, and shows that their method achieves better performance across multiple metrics. They also show that their method is more robust to reward hacking, a phenomenon where models optimize for a single reward function at the expense of overall performance. The authors conclude that their approach provides a scalable and effective way to improve diffusion models for a wide range of tasks, including generating images that align with human preferences, are fair and diverse, and accurately represent object compositions.