Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer

Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer

May 28, 2024 | Zhihan Liu*, Miao Lu*, Shena0 Zhang†, Boyi Liu§, Hongyi Guo†, Yingxiang Yang§, Jose Blanchet†, Zhaoran Wang†
This paper addresses the issue of overoptimization in Reinforcement Learning from Human Feedback (RLHF), where an imperfectly learned reward model can misguide the generative model to output undesired responses. The authors propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model, minimizing the sum of its maximum likelihood estimation (MLE) loss and a reward penalty term. This approach prevents the policy from choosing actions with spurious high proxy rewards, resulting in provable sample efficiency. The algorithm is reformulated to be equivalent and easy to implement, combining a preference optimization loss and a supervised learning loss. The proposed algorithm, Regularized Preference Optimization (RPO), fuses the direct preference optimization (DPO) loss with the supervised fine-tuning (SFT) loss to mitigate overoptimization. Experiments show that RPO outperforms DPO baselines in aligning large language models (LLMs). Theoretical analysis and empirical results demonstrate that RPO effectively mitigates overoptimization by incorporating SFT loss as an implicit adversarial regularizer. The work provides both theoretical guarantees and empirical evidence for the effectiveness of RPO in aligning LLMs.This paper addresses the issue of overoptimization in Reinforcement Learning from Human Feedback (RLHF), where an imperfectly learned reward model can misguide the generative model to output undesired responses. The authors propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model, minimizing the sum of its maximum likelihood estimation (MLE) loss and a reward penalty term. This approach prevents the policy from choosing actions with spurious high proxy rewards, resulting in provable sample efficiency. The algorithm is reformulated to be equivalent and easy to implement, combining a preference optimization loss and a supervised learning loss. The proposed algorithm, Regularized Preference Optimization (RPO), fuses the direct preference optimization (DPO) loss with the supervised fine-tuning (SFT) loss to mitigate overoptimization. Experiments show that RPO outperforms DPO baselines in aligning large language models (LLMs). Theoretical analysis and empirical results demonstrate that RPO effectively mitigates overoptimization by incorporating SFT loss as an implicit adversarial regularizer. The work provides both theoretical guarantees and empirical evidence for the effectiveness of RPO in aligning LLMs.
Reach us at info@study.space
Understanding Provably Mitigating Overoptimization in RLHF%3A Your SFT Loss is Implicitly an Adversarial Regularizer