Dataset Reset Policy Optimization for RLHF

Dataset Reset Policy Optimization for RLHF

16 Apr 2024 | Jonathan D. Chang, Wenhao Zhan, Owen Oertell, Kianté Brantley, Dipendra Misra, Jason D. Lee, Wen Sun
The paper introduces Dataset Reset Policy Optimization (DR-PO), a novel algorithm for Reinforcement Learning with Human Feedback (RLHF) that leverages the ability to reset the system to offline preference datasets. DR-PO integrates offline data into the online policy training process by resetting the policy optimizer to states in the offline dataset, rather than starting from the initial state distribution. The authors theoretically show that DR-PO can learn a policy that performs at least as well as any policy covered by the offline dataset, under general function approximation with finite sample complexity. Empirical results on the TL:DR summarization and Anthropic Helpful Harmful (HH) datasets demonstrate that DR-PO outperforms Proximal Policy Optimization (PPO) and Direction Preference Optimization (DPO) in terms of GPT4 win-rate, indicating its effectiveness in optimizing the reward model and generating high-quality generations. The key contributions of the work include the theoretical guarantees of DR-PO and its superior performance over existing RLHF algorithms.The paper introduces Dataset Reset Policy Optimization (DR-PO), a novel algorithm for Reinforcement Learning with Human Feedback (RLHF) that leverages the ability to reset the system to offline preference datasets. DR-PO integrates offline data into the online policy training process by resetting the policy optimizer to states in the offline dataset, rather than starting from the initial state distribution. The authors theoretically show that DR-PO can learn a policy that performs at least as well as any policy covered by the offline dataset, under general function approximation with finite sample complexity. Empirical results on the TL:DR summarization and Anthropic Helpful Harmful (HH) datasets demonstrate that DR-PO outperforms Proximal Policy Optimization (PPO) and Direction Preference Optimization (DPO) in terms of GPT4 win-rate, indicating its effectiveness in optimizing the reward model and generating high-quality generations. The key contributions of the work include the theoretical guarantees of DR-PO and its superior performance over existing RLHF algorithms.
Reach us at info@study.space
[slides and audio] Dataset Reset Policy Optimization for RLHF