May 28, 2024 | Zhihan Liu, Miao Lu, Shenao Zhang, Boyi Liu, Hongyi Guo, Yingxiang Yang, Jose Blanchet, Zhaoran Wang
The paper addresses the issue of overoptimization in Reinforcement Learning from Human Feedback (RLHF), which can lead to undesirable responses in large language models (LLMs). The authors propose a theoretical algorithm that minimizes the maximum likelihood estimation of the loss and a reward penalty term to prevent the policy from choosing actions with spurious high proxy rewards. This algorithm is shown to be sample-efficient under a partial coverage condition. The practical implementation of this algorithm, named Regularized Preference Optimization (RPO), combines a preference optimization loss and a supervised fine-tuning (SFT) loss. RPO effectively mitigates overoptimization and achieves better alignment performance compared to direct preference optimization (DPO) in both in-data distribution and standard LLM benchmarks like MT-bench and AlpacaEval 2.0. The paper provides theoretical guarantees and empirical evidence to support these findings.The paper addresses the issue of overoptimization in Reinforcement Learning from Human Feedback (RLHF), which can lead to undesirable responses in large language models (LLMs). The authors propose a theoretical algorithm that minimizes the maximum likelihood estimation of the loss and a reward penalty term to prevent the policy from choosing actions with spurious high proxy rewards. This algorithm is shown to be sample-efficient under a partial coverage condition. The practical implementation of this algorithm, named Regularized Preference Optimization (RPO), combines a preference optimization loss and a supervised fine-tuning (SFT) loss. RPO effectively mitigates overoptimization and achieves better alignment performance compared to direct preference optimization (DPO) in both in-data distribution and standard LLM benchmarks like MT-bench and AlpacaEval 2.0. The paper provides theoretical guarantees and empirical evidence to support these findings.