29 Jan 2024 | Banghua Zhu, Michael I. Jordan, and Jiantao Jiao
This paper addresses the issues of reward overfitting and overoptimization in Reinforcement Learning from Human Feedback (RLHF). RLHF aligns language models with human values by first learning human preferences using a reward model, then fine-tuning the model based on this reward. However, the reward model's performance degrades after one epoch of training, and optimizing too much against the reward model can lead to suboptimal policies.
The authors propose an improved reward learning algorithm called Iterative Data Smoothing (IDS). The core idea is to update the model with data and also update the data using the model, replacing hard labels with soft labels. This approach helps mitigate reward overfitting and overoptimization by reducing the impact of imbalanced data and preventing the model from overfitting to the training data.
The paper analyzes the problem using a multi-armed bandit framework, showing that the standard empirical cross-entropy loss can lead to overfitting when the data is imbalanced. IDS addresses this by iteratively updating the data labels based on the model's predictions, which helps in learning the true reward distribution.
Empirical results show that IDS outperforms traditional methods in both bandit and neural network settings. The algorithm is effective in mitigating overfitting and overoptimization by adjusting the data labels during training, leading to better policy learning. The paper also discusses related work, including knowledge distillation and preference-based reinforcement learning, and highlights the importance of soft labels in improving model performance. The proposed method is shown to be effective in both theoretical and practical settings, providing a robust solution to the challenges faced in RLHF.This paper addresses the issues of reward overfitting and overoptimization in Reinforcement Learning from Human Feedback (RLHF). RLHF aligns language models with human values by first learning human preferences using a reward model, then fine-tuning the model based on this reward. However, the reward model's performance degrades after one epoch of training, and optimizing too much against the reward model can lead to suboptimal policies.
The authors propose an improved reward learning algorithm called Iterative Data Smoothing (IDS). The core idea is to update the model with data and also update the data using the model, replacing hard labels with soft labels. This approach helps mitigate reward overfitting and overoptimization by reducing the impact of imbalanced data and preventing the model from overfitting to the training data.
The paper analyzes the problem using a multi-armed bandit framework, showing that the standard empirical cross-entropy loss can lead to overfitting when the data is imbalanced. IDS addresses this by iteratively updating the data labels based on the model's predictions, which helps in learning the true reward distribution.
Empirical results show that IDS outperforms traditional methods in both bandit and neural network settings. The algorithm is effective in mitigating overfitting and overoptimization by adjusting the data labels during training, leading to better policy learning. The paper also discusses related work, including knowledge distillation and preference-based reinforcement learning, and highlights the importance of soft labels in improving model performance. The proposed method is shown to be effective in both theoretical and practical settings, providing a robust solution to the challenges faced in RLHF.