[slides and audio] Secrets of RLHF in Large Language Models Part II%3A Reward Modeling

This paper addresses the challenges of reward modeling in Reinforcement Learning from Human Feedback (RLHF) for large language models. The main issues discussed are the presence of incorrect and ambiguous preference pairs in the dataset, which can hinder the accuracy of reward models, and the poor generalization of reward models trained on specific data distributions to out-of-distribution examples. To tackle these issues, the authors propose several novel methods: 1. **Preference Strength Measurement**: A method to measure the strength of preferences within the data using a voting mechanism of multiple reward models. This helps in distinguishing between incorrect, ambiguous, and normal preferences, and allows for the correction of incorrect labels and smoothing of ambiguous labels. 2. **Contrastive Learning**: Introduces an adaptive margin to the loss function of the reward model, enhancing its ability to distinguish between chosen and rejected responses. This improves the model's generalization to out-of-distribution examples. 3. **Meta-Learning**: Uses meta-learning to enable the reward model to maintain its ability to differentiate subtle differences in out-of-distribution samples, facilitating iterative RLHF optimization. The paper also evaluates the effectiveness of these methods through experiments on the Anthropic’s HH-RLHF dataset and OpenAI’s summarization dataset, demonstrating consistent improvements in language model performance over multiple rounds of training. Additionally, the authors open-source the training code, the dataset with preference strength information, and the cleaned validation set annotated by GPT-4, providing resources for further research.This paper addresses the challenges of reward modeling in Reinforcement Learning from Human Feedback (RLHF) for large language models. The main issues discussed are the presence of incorrect and ambiguous preference pairs in the dataset, which can hinder the accuracy of reward models, and the poor generalization of reward models trained on specific data distributions to out-of-distribution examples. To tackle these issues, the authors propose several novel methods: 1. **Preference Strength Measurement**: A method to measure the strength of preferences within the data using a voting mechanism of multiple reward models. This helps in distinguishing between incorrect, ambiguous, and normal preferences, and allows for the correction of incorrect labels and smoothing of ambiguous labels. 2. **Contrastive Learning**: Introduces an adaptive margin to the loss function of the reward model, enhancing its ability to distinguish between chosen and rejected responses. This improves the model's generalization to out-of-distribution examples. 3. **Meta-Learning**: Uses meta-learning to enable the reward model to maintain its ability to differentiate subtle differences in out-of-distribution samples, facilitating iterative RLHF optimization. The paper also evaluates the effectiveness of these methods through experiments on the Anthropic’s HH-RLHF dataset and OpenAI’s summarization dataset, demonstrating consistent improvements in language model performance over multiple rounds of training. Additionally, the authors open-source the training code, the dataset with preference strength information, and the cleaned validation set annotated by GPT-4, providing resources for further research.

Secrets of RLHF in Large Language Models Part II: Reward Modeling