16 Mar 2024 | Jiachen Li, Weixi Feng, Wenhu Chen, William Yang Wang
Reward Guided Latent Consistency Distillation (RG-LCD) is a novel approach that enhances the efficiency and quality of text-to-image synthesis by integrating feedback from a reward model (RM) into the latent consistency distillation (LCD) process. The LCD method distills a latent consistency model (LCM) from a pre-trained latent diffusion model (LDM), enabling high-fidelity image generation in just 2-4 inference steps. However, this efficiency comes at the cost of sample quality. RG-LCD addresses this by aligning the LCM's output with human preferences during training, using a reward model to maximize the reward associated with single-step generation. This approach results in 25 times faster inference without quality loss, as validated by human evaluations.
To mitigate the risk of reward over-optimization, RG-LCD introduces a latent proxy RM (LRM) as an intermediary between the LCM and the RM. This allows for indirect optimization towards the RM, avoiding high-frequency noise in generated images and improving metrics such as FID on MS-COCO and HPSv2.1 scores. The LRM is pretrained and fine-tuned to match the preferences of the expert RM, enabling the model to learn from a wider range of reward models, including non-differentiable ones.
RG-LCD is evaluated using human feedback and automatic metrics, demonstrating superior performance compared to baseline methods. Human evaluations show that RG-LCMs with 2-step generations are preferred over 50-step generations from the teacher LDM, achieving a 25-fold inference speedup. Automatic metrics also confirm improved sample quality, with RG-LCMs achieving higher HPSv2.1 scores and lower FID values on MS-COCO. The integration of an LRM into the RG-LCD process further enhances these results, reducing high-frequency noise and improving overall sample quality.
The key contributions of RG-LCD include the introduction of a framework that incorporates RM feedback into the LCD process, the development of an LRM to enable indirect RM optimization, and a significant inference speedup without compromising sample quality. These advancements make RG-LCD a promising approach for efficient and high-quality text-to-image synthesis.Reward Guided Latent Consistency Distillation (RG-LCD) is a novel approach that enhances the efficiency and quality of text-to-image synthesis by integrating feedback from a reward model (RM) into the latent consistency distillation (LCD) process. The LCD method distills a latent consistency model (LCM) from a pre-trained latent diffusion model (LDM), enabling high-fidelity image generation in just 2-4 inference steps. However, this efficiency comes at the cost of sample quality. RG-LCD addresses this by aligning the LCM's output with human preferences during training, using a reward model to maximize the reward associated with single-step generation. This approach results in 25 times faster inference without quality loss, as validated by human evaluations.
To mitigate the risk of reward over-optimization, RG-LCD introduces a latent proxy RM (LRM) as an intermediary between the LCM and the RM. This allows for indirect optimization towards the RM, avoiding high-frequency noise in generated images and improving metrics such as FID on MS-COCO and HPSv2.1 scores. The LRM is pretrained and fine-tuned to match the preferences of the expert RM, enabling the model to learn from a wider range of reward models, including non-differentiable ones.
RG-LCD is evaluated using human feedback and automatic metrics, demonstrating superior performance compared to baseline methods. Human evaluations show that RG-LCMs with 2-step generations are preferred over 50-step generations from the teacher LDM, achieving a 25-fold inference speedup. Automatic metrics also confirm improved sample quality, with RG-LCMs achieving higher HPSv2.1 scores and lower FID values on MS-COCO. The integration of an LRM into the RG-LCD process further enhances these results, reducing high-frequency noise and improving overall sample quality.
The key contributions of RG-LCD include the introduction of a framework that incorporates RM feedback into the LCD process, the development of an LRM to enable indirect RM optimization, and a significant inference speedup without compromising sample quality. These advancements make RG-LCD a promising approach for efficient and high-quality text-to-image synthesis.