[slides and audio] Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models

This paper addresses the challenge of training text-to-image diffusion models using feedback from a reward model. The authors propose Deep Reward Tuning (DRTune), an algorithm that directly supervises the final output image of a text-to-image diffusion model and back-propagates through the iterative sampling process to the input noise. DRTune overcomes the depth-efficiency dilemma by stopping the gradients of the denoising network input and sampling a subset of equally spaced steps for back-propagation. This approach allows for efficient and effective optimization of early denoising steps without significant memory consumption or computation. Extensive evaluations on various reward models show that DRTune consistently outperforms other algorithms, particularly for low-level control signals. The authors also fine-tune the Stable Diffusion XL 1.0 (SDXL 1.0) model using DRTune to optimize Human Preference Score v2.1, resulting in the Favorable Diffusion XL 1.0 (FDXL 1.0) model, which significantly enhances image quality compared to SDXL 1.0 and achieves comparable quality to Midjourney v5.2. The contributions of this work include the proposal of DRTune for efficient supervision of early denoising steps and the introduction of FDXL 1.0, a state-of-the-art open-source text-to-image generative model tuned on human preferences.This paper addresses the challenge of training text-to-image diffusion models using feedback from a reward model. The authors propose Deep Reward Tuning (DRTune), an algorithm that directly supervises the final output image of a text-to-image diffusion model and back-propagates through the iterative sampling process to the input noise. DRTune overcomes the depth-efficiency dilemma by stopping the gradients of the denoising network input and sampling a subset of equally spaced steps for back-propagation. This approach allows for efficient and effective optimization of early denoising steps without significant memory consumption or computation. Extensive evaluations on various reward models show that DRTune consistently outperforms other algorithms, particularly for low-level control signals. The authors also fine-tune the Stable Diffusion XL 1.0 (SDXL 1.0) model using DRTune to optimize Human Preference Score v2.1, resulting in the Favorable Diffusion XL 1.0 (FDXL 1.0) model, which significantly enhances image quality compared to SDXL 1.0 and achieves comparable quality to Midjourney v5.2. The contributions of this work include the proposal of DRTune for efficient supervision of early denoising steps and the introduction of FDXL 1.0, a state-of-the-art open-source text-to-image generative model tuned on human preferences.

Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models

1 May 2024 | Xiaoshi Wu*1,3, Yiming Hao*2, Manyuan Zhang1, Keqiang Sun1, Zhaoyang Huang3, Guanglu Song4, Yu Liu4, and Hongsheng Li1,2

1 May 2024 | Xiaoshi Wu1,3, Yiming Hao2, Manyuan Zhang1, Keqiang Sun1, Zhaoyang Huang3, Guanglu Song4, Yu Liu4, and Hongsheng Li1,2