16 Apr 2024 | Jonathan D. Chang*, Wenhao Zhan*, Owen Oertell, Kianté Brantley, Dipendra Misra, Jason D. Lee, Wen Sun
This paper proposes a new algorithm called Dataset Reset Policy Optimization (DR-PO) for Reinforcement Learning from Human Preference (RLHF). DR-PO integrates offline preference data into online policy training by resetting the policy optimizer to states in the offline dataset. This approach allows the policy to learn from informative states in the offline data, leading to better performance compared to existing methods like PPO and DPO. Theoretical analysis shows that DR-PO can learn policies that are at least as good as any policy covered by the offline data under general function approximation with finite sample complexity. Empirical results on two standard RLHF datasets, TL;DR summarization and Anthropic HH, demonstrate that DR-PO outperforms PPO and DPO in terms of GPT4 win-rate. DR-PO is also computationally tractable and can scale well across different model sizes. The key idea of dataset reset is shown to be effective in improving both theoretical guarantees and practical performance in RLHF.This paper proposes a new algorithm called Dataset Reset Policy Optimization (DR-PO) for Reinforcement Learning from Human Preference (RLHF). DR-PO integrates offline preference data into online policy training by resetting the policy optimizer to states in the offline dataset. This approach allows the policy to learn from informative states in the offline data, leading to better performance compared to existing methods like PPO and DPO. Theoretical analysis shows that DR-PO can learn policies that are at least as good as any policy covered by the offline data under general function approximation with finite sample complexity. Empirical results on two standard RLHF datasets, TL;DR summarization and Anthropic HH, demonstrate that DR-PO outperforms PPO and DPO in terms of GPT4 win-rate. DR-PO is also computationally tractable and can scale well across different model sizes. The key idea of dataset reset is shown to be effective in improving both theoretical guarantees and practical performance in RLHF.