31 May 2024 | Zihuiwen Ye*, Fraser Greenlee-Scott, Max Bartolo, Phil Blunsom, Jon Ander Campos, Matthias Gallé
The paper proposes a novel approach to enhance reward models (RMs) in reinforcement learning from human feedback (RLHF) by incorporating synthetic natural language critiques generated by large language models (LLMs). RMs are crucial in RLHF as they predict scores reflecting human preferences, but they often overfit to superficial features and require significant human annotation. The proposed method involves prompting LLMs to generate critiques for each prompt-completion pair in the training data, evaluating aspects such as instruction following, correctness, and style. These critiques are then used to train RMs, which are evaluated on test sets augmented with these critiques. The experiments show that high-quality critiques improve RM performance, especially in low-resource settings, and that the impact of critiques is more pronounced on weaker pretrained models. The approach also enhances data efficiency and interpretability, making it a cost-effective method for obtaining competitive reward models.The paper proposes a novel approach to enhance reward models (RMs) in reinforcement learning from human feedback (RLHF) by incorporating synthetic natural language critiques generated by large language models (LLMs). RMs are crucial in RLHF as they predict scores reflecting human preferences, but they often overfit to superficial features and require significant human annotation. The proposed method involves prompting LLMs to generate critiques for each prompt-completion pair in the training data, evaluating aspects such as instruction following, correctness, and style. These critiques are then used to train RMs, which are evaluated on test sets augmented with these critiques. The experiments show that high-quality critiques improve RM performance, especially in low-resource settings, and that the impact of critiques is more pronounced on weaker pretrained models. The approach also enhances data efficiency and interpretability, making it a cost-effective method for obtaining competitive reward models.