Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF

Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF

May 2024; Revised July 2024 | Shicong Cen*, Jincheng Mei†, Katayoon Goshvadi, Hanjun Dai, Tong Yang*, Sherry Yang, Dale Schuurmans, Yuejie Chi*, Bo Dai
This paper introduces a unified approach to online and offline reinforcement learning from human feedback (RLHF), called value-incentivized preference optimization (VPO). VPO regularizes the maximum-likelihood estimate of the reward function with the corresponding value function, modulated by a sign to indicate whether optimism or pessimism is chosen. VPO directly optimizes the policy with implicit reward modeling, sharing a simpler RLHF pipeline similar to direct preference optimization. Theoretical guarantees of VPO are provided for both online and offline settings, matching the rates of their standard RL counterparts. Experiments on text summarization and dialog verify the practicality and effectiveness of VPO. The paper discusses the challenges of incorporating reward uncertainty in direct preference optimization when parameterizing policies with large-scale neural networks. It highlights the importance of reward calibration and provides theoretical guarantees for VPO in both online and offline settings. The paper also compares VPO with existing methods such as DPO and IPO, showing that VPO achieves better performance and is more robust to over-optimization. The paper concludes that VPO provides a practical and theoretically grounded approach to achieving principled optimism and pessimism in RLHF.This paper introduces a unified approach to online and offline reinforcement learning from human feedback (RLHF), called value-incentivized preference optimization (VPO). VPO regularizes the maximum-likelihood estimate of the reward function with the corresponding value function, modulated by a sign to indicate whether optimism or pessimism is chosen. VPO directly optimizes the policy with implicit reward modeling, sharing a simpler RLHF pipeline similar to direct preference optimization. Theoretical guarantees of VPO are provided for both online and offline settings, matching the rates of their standard RL counterparts. Experiments on text summarization and dialog verify the practicality and effectiveness of VPO. The paper discusses the challenges of incorporating reward uncertainty in direct preference optimization when parameterizing policies with large-scale neural networks. It highlights the importance of reward calibration and provides theoretical guarantees for VPO in both online and offline settings. The paper also compares VPO with existing methods such as DPO and IPO, showing that VPO achieves better performance and is more robust to over-optimization. The paper concludes that VPO provides a practical and theoretically grounded approach to achieving principled optimism and pessimism in RLHF.
Reach us at info@study.space