PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models
This paper introduces PRDP, a novel method for large-scale reward finetuning of diffusion models. PRDP enables stable black-box reward maximization for diffusion models on large-scale prompt datasets with over 100K prompts. The key innovation is the Reward Difference Prediction (RDP) objective, which has the same optimal solution as the RL objective while enjoying better training stability. Specifically, the RDP objective is a supervised regression objective that tasks the diffusion model with predicting the reward difference of generated image pairs from their denoising trajectories. We theoretically prove that the diffusion model that obtains perfect reward difference prediction is exactly the maximizer of the RL objective. We further develop an online algorithm with proximal updates to stably optimize the RDP objective. In experiments, we demonstrate that PRDP can match the reward maximization ability of well-established RL-based methods in small-scale training. Furthermore, through large-scale training on text prompts from the Human Preference Dataset v2 and the Pick-a-Pic v1 dataset, PRDP achieves superior generation quality on a diverse set of complex, unseen prompts whereas RL-based methods completely fail.
PRDP is based on the idea of converting the RLHF objective into a supervised regression objective. This allows for stable training on large-scale prompt datasets. The method is inspired by the success of DPO in language models, which converts the RLHF objective into a supervised classification objective. For diffusion models, we derive a new supervised regression objective, called Reward Difference Prediction (RDP), that has the same optimal solution as the RLHF objective while enjoying better training stability. Specifically, our RDP objective tasks the diffusion model with predicting the reward difference of generated image pairs from their denoising trajectories. We prove that the diffusion model that obtains perfect reward difference prediction is exactly the maximizer of the RLHF objective. We further propose proximal updates and online optimization to improve training stability and generation quality.
Our contributions are summarized as follows:
• We propose PRDP, a scalable reward finetuning method for diffusion models, with a new reward difference prediction objective and its stable optimization algorithm.
• PRDP achieves stable black-box reward maximization for diffusion models for the first time on large-scale prompt datasets with over 100K prompts.
• PRDP exhibits superior generation quality and generalization to unseen prompts through large-scale training.PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models
This paper introduces PRDP, a novel method for large-scale reward finetuning of diffusion models. PRDP enables stable black-box reward maximization for diffusion models on large-scale prompt datasets with over 100K prompts. The key innovation is the Reward Difference Prediction (RDP) objective, which has the same optimal solution as the RL objective while enjoying better training stability. Specifically, the RDP objective is a supervised regression objective that tasks the diffusion model with predicting the reward difference of generated image pairs from their denoising trajectories. We theoretically prove that the diffusion model that obtains perfect reward difference prediction is exactly the maximizer of the RL objective. We further develop an online algorithm with proximal updates to stably optimize the RDP objective. In experiments, we demonstrate that PRDP can match the reward maximization ability of well-established RL-based methods in small-scale training. Furthermore, through large-scale training on text prompts from the Human Preference Dataset v2 and the Pick-a-Pic v1 dataset, PRDP achieves superior generation quality on a diverse set of complex, unseen prompts whereas RL-based methods completely fail.
PRDP is based on the idea of converting the RLHF objective into a supervised regression objective. This allows for stable training on large-scale prompt datasets. The method is inspired by the success of DPO in language models, which converts the RLHF objective into a supervised classification objective. For diffusion models, we derive a new supervised regression objective, called Reward Difference Prediction (RDP), that has the same optimal solution as the RLHF objective while enjoying better training stability. Specifically, our RDP objective tasks the diffusion model with predicting the reward difference of generated image pairs from their denoising trajectories. We prove that the diffusion model that obtains perfect reward difference prediction is exactly the maximizer of the RLHF objective. We further propose proximal updates and online optimization to improve training stability and generation quality.
Our contributions are summarized as follows:
• We propose PRDP, a scalable reward finetuning method for diffusion models, with a new reward difference prediction objective and its stable optimization algorithm.
• PRDP achieves stable black-box reward maximization for diffusion models for the first time on large-scale prompt datasets with over 100K prompts.
• PRDP exhibits superior generation quality and generalization to unseen prompts through large-scale training.