May 2024; Revised July 2024 | Shicong Cen*, Jincheng Mei†, Katayoon Goshvadi, Hanjun Dai, Tong Yang*, Sherry Yang, Dale Schuurmans, Yuejie Chi*, Bo Dai
This paper introduces a unified approach to online and offline Reinforcement Learning from Human Feedback (RLHF) called Value-Incentivized Preference Optimization (VPO). VPO regularizes the maximum-likelihood estimate of the reward function with the corresponding value function, modulated by a sign to indicate whether optimism or pessimism is chosen. VPO directly optimizes the policy with implicit reward modeling, simplifying the RLHF pipeline similar to direct preference optimization. Theoretical guarantees are provided for both online and offline settings, matching the rates of standard RL counterparts. Experiments on text summarization and dialog tasks demonstrate the practicality and effectiveness of VPO. The paper also highlights the critical role of reward calibration and discusses the implications for conservative offline RL methods and regularized RLHF methods.This paper introduces a unified approach to online and offline Reinforcement Learning from Human Feedback (RLHF) called Value-Incentivized Preference Optimization (VPO). VPO regularizes the maximum-likelihood estimate of the reward function with the corresponding value function, modulated by a sign to indicate whether optimism or pessimism is chosen. VPO directly optimizes the policy with implicit reward modeling, simplifying the RLHF pipeline similar to direct preference optimization. Theoretical guarantees are provided for both online and offline settings, matching the rates of standard RL counterparts. Experiments on text summarization and dialog tasks demonstrate the practicality and effectiveness of VPO. The paper also highlights the critical role of reward calibration and discusses the implications for conservative offline RL methods and regularized RLHF methods.