31 May 2024 | Zihuiwen Ye, Fraser Greenlee-Scott, Max Bartolo, Phil Blunsom, Jon Ander Campos, Matthias Gallé
This paper proposes a method to improve reward models (RMs) used in reinforcement learning from human feedback (RLHF) by incorporating synthetic natural language critiques generated by large language models (LLMs). RMs are trained to predict scores reflecting human preferences, but traditional methods require extensive human annotation, which is costly and time-consuming. Additionally, RMs can overfit to superficial features in the training data, reducing their generalization ability. To address these issues, the authors generate synthetic critiques that evaluate responses based on aspects such as instruction-following, correctness, and style. These critiques provide richer signals and more robust features for RMs to assess and score on.
The synthetic critiques are generated by prompting LLMs to evaluate each completion in a preference pair. These critiques are then used to train RMs to predict scalar rewards. The results show that high-quality synthetic critiques significantly improve RM performance, especially in low-resource settings. In particular, a high-quality model-generated critique is roughly equivalent to 40 vanilla preference pairs, making the process more efficient. Conversely, low-quality critiques negatively impact performance. Furthermore, incorporating critiques enhances the interpretability and robustness of RM training.
The experiments demonstrate that using synthetic critiques improves RM performance and data efficiency, especially when training from different pretrained models. The method is both accessible and cost-effective, as synthetic critiques can be efficiently generated using open-source models. The study also shows that critiques are more effective on weaker checkpoints, and that the benefit of critiques is greater when they are of high quality. The results indicate that synthetic critiques can significantly enhance the performance of RMs, particularly in challenging tasks such as reasoning and safety. The approach has the potential to make reward model training more efficient and effective, reducing the reliance on costly human annotations.This paper proposes a method to improve reward models (RMs) used in reinforcement learning from human feedback (RLHF) by incorporating synthetic natural language critiques generated by large language models (LLMs). RMs are trained to predict scores reflecting human preferences, but traditional methods require extensive human annotation, which is costly and time-consuming. Additionally, RMs can overfit to superficial features in the training data, reducing their generalization ability. To address these issues, the authors generate synthetic critiques that evaluate responses based on aspects such as instruction-following, correctness, and style. These critiques provide richer signals and more robust features for RMs to assess and score on.
The synthetic critiques are generated by prompting LLMs to evaluate each completion in a preference pair. These critiques are then used to train RMs to predict scalar rewards. The results show that high-quality synthetic critiques significantly improve RM performance, especially in low-resource settings. In particular, a high-quality model-generated critique is roughly equivalent to 40 vanilla preference pairs, making the process more efficient. Conversely, low-quality critiques negatively impact performance. Furthermore, incorporating critiques enhances the interpretability and robustness of RM training.
The experiments demonstrate that using synthetic critiques improves RM performance and data efficiency, especially when training from different pretrained models. The method is both accessible and cost-effective, as synthetic critiques can be efficiently generated using open-source models. The study also shows that critiques are more effective on weaker checkpoints, and that the benefit of critiques is greater when they are of high quality. The results indicate that synthetic critiques can significantly enhance the performance of RMs, particularly in challenging tasks such as reasoning and safety. The approach has the potential to make reward model training more efficient and effective, reducing the reliance on costly human annotations.