22 Jan 2024 | Alizée Pace, Jonathan Mallinson, Eric Malmi, Sebastian Krause, Aliaksei Severyn
The paper presents a novel approach to improve the quality of reward models used in Reinforcement Learning from Human Feedback (RLHF) for language model alignment. The approach, called West-of-N sampling, generates synthetic preference data by selecting the best and worst responses from a pool of $N$ generated responses to a given query. This self-training strategy enhances the training dataset with high-quality, on-policy preference pairs, improving the performance of the reward model. Empirical results show that West-of-N sampling significantly improves the performance of reward models, comparable to or greater than adding a similar amount of human preference data. The method is effective across different datasets and initial preference data types, demonstrating the potential of Best-of-N sampling and semi-supervised learning in reward model training. The paper also discusses the theoretical guarantees and mechanisms behind the effectiveness of West-of-N sampling, providing insights into the quality and distribution of generated preference data.The paper presents a novel approach to improve the quality of reward models used in Reinforcement Learning from Human Feedback (RLHF) for language model alignment. The approach, called West-of-N sampling, generates synthetic preference data by selecting the best and worst responses from a pool of $N$ generated responses to a given query. This self-training strategy enhances the training dataset with high-quality, on-policy preference pairs, improving the performance of the reward model. Empirical results show that West-of-N sampling significantly improves the performance of reward models, comparable to or greater than adding a similar amount of human preference data. The method is effective across different datasets and initial preference data types, demonstrating the potential of Best-of-N sampling and semi-supervised learning in reward model training. The paper also discusses the theoretical guarantees and mechanisms behind the effectiveness of West-of-N sampling, providing insights into the quality and distribution of generated preference data.