PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models

PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models

27 Mar 2024 | Fei Deng, Qifei Wang, Wei Wei, Matthias Grundmann, Tingbo Hou
The paper introduces Proximal Reward Difference Prediction (PRDP), a novel method for large-scale reward finetuning of diffusion models. PRDP addresses the instability issues of existing reinforcement learning (RL)-based methods in training on large-scale prompt datasets, achieving stable black-box reward finetuning for the first time. The key innovation is the Reward Difference Prediction (RDP) objective, which is a supervised regression objective that tasks the diffusion model with predicting the reward difference of generated image pairs from their denoising trajectories. The RDP objective is theoretically proven to have the same optimal solution as the RL objective while offering better training stability. PRDP is evaluated on two large-scale datasets, demonstrating superior generation quality on complex, unseen prompts compared to existing RL-based methods. The paper also introduces online optimization and KL regularization to improve the stability and quality of generation.The paper introduces Proximal Reward Difference Prediction (PRDP), a novel method for large-scale reward finetuning of diffusion models. PRDP addresses the instability issues of existing reinforcement learning (RL)-based methods in training on large-scale prompt datasets, achieving stable black-box reward finetuning for the first time. The key innovation is the Reward Difference Prediction (RDP) objective, which is a supervised regression objective that tasks the diffusion model with predicting the reward difference of generated image pairs from their denoising trajectories. The RDP objective is theoretically proven to have the same optimal solution as the RL objective while offering better training stability. PRDP is evaluated on two large-scale datasets, demonstrating superior generation quality on complex, unseen prompts compared to existing RL-based methods. The paper also introduces online optimization and KL regularization to improve the stability and quality of generation.
Reach us at info@study.space
Understanding PRDP%3A Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models