22 Jan 2024 | Alizée Pace, Jonathan Mallinson, Eric Malmi, Sebastian Krause, Aliaksei Severyn
This paper introduces West-of-N, a novel method for generating synthetic preference data to improve reward models in reinforcement learning from human feedback (RLHF). The approach leverages Best-of-N sampling strategies, which are commonly used in language model training, to generate high-quality, on-policy preference pairs. By selecting the best and worst responses from a set of generated outputs, West-of-N creates synthetic preference data that can be used to train reward models. This method is shown to be effective in improving reward model performance, with results comparable to or better than adding human preference data. The work also demonstrates the potential of self-training and semi-supervised learning in reward model training, opening new avenues for improving language model alignment. The experiments validate the effectiveness of West-of-N across multiple datasets and show that it can significantly enhance the performance of reward models, whether they are trained on human feedback or synthetic data. The results highlight the importance of on-policy preference data and the benefits of self-training in improving the quality of preference labels. The paper also discusses the theoretical guarantees of the West-of-N approach and provides insights into the mechanisms behind its effectiveness. Overall, the study contributes to the field of language model alignment by offering a promising solution to the challenges of reward modeling.This paper introduces West-of-N, a novel method for generating synthetic preference data to improve reward models in reinforcement learning from human feedback (RLHF). The approach leverages Best-of-N sampling strategies, which are commonly used in language model training, to generate high-quality, on-policy preference pairs. By selecting the best and worst responses from a set of generated outputs, West-of-N creates synthetic preference data that can be used to train reward models. This method is shown to be effective in improving reward model performance, with results comparable to or better than adding human preference data. The work also demonstrates the potential of self-training and semi-supervised learning in reward model training, opening new avenues for improving language model alignment. The experiments validate the effectiveness of West-of-N across multiple datasets and show that it can significantly enhance the performance of reward models, whether they are trained on human feedback or synthetic data. The results highlight the importance of on-policy preference data and the benefits of self-training in improving the quality of preference labels. The paper also discusses the theoretical guarantees of the West-of-N approach and provides insights into the mechanisms behind its effectiveness. Overall, the study contributes to the field of language model alignment by offering a promising solution to the challenges of reward modeling.