Reinforcement Learning from Human Feedback with Active Queries

Reinforcement Learning from Human Feedback with Active Queries

Feb 2024 | Kaixuan Ji* and Jiafan He† and Quanquan Gu§
This paper proposes a query-efficient reinforcement learning from human feedback (RLHF) method, named Active Proximal Policy Optimization (APPO), and its practical version, Active Direct Preference Optimization (ADPO). The key idea is to leverage active learning principles to reduce the number of human preference queries while maintaining performance. The problem is formalized as a contextual dueling bandit, where the goal is to minimize regret and query complexity. APPO is designed with a regret bound of $ \widetilde{O}(d^{2}/\Delta) $ and a query complexity of $ \widetilde{O}(d^{2}/\Delta^{2}) $, where $ d $ is the feature dimension and $ \Delta $ is the sub-optimality gap. ADPO, based on direct preference optimization (DPO), achieves similar performance with only half the number of queries. Experiments show that ADPO outperforms DPO on the Open-LLM-Benchmark, achieving a 0.35% margin improvement. Theoretical analysis confirms the effectiveness of the proposed methods in reducing query complexity while maintaining performance. The results demonstrate that ADPO is a promising approach for aligning large language models with human preferences efficiently.This paper proposes a query-efficient reinforcement learning from human feedback (RLHF) method, named Active Proximal Policy Optimization (APPO), and its practical version, Active Direct Preference Optimization (ADPO). The key idea is to leverage active learning principles to reduce the number of human preference queries while maintaining performance. The problem is formalized as a contextual dueling bandit, where the goal is to minimize regret and query complexity. APPO is designed with a regret bound of $ \widetilde{O}(d^{2}/\Delta) $ and a query complexity of $ \widetilde{O}(d^{2}/\Delta^{2}) $, where $ d $ is the feature dimension and $ \Delta $ is the sub-optimality gap. ADPO, based on direct preference optimization (DPO), achieves similar performance with only half the number of queries. Experiments show that ADPO outperforms DPO on the Open-LLM-Benchmark, achieving a 0.35% margin improvement. Theoretical analysis confirms the effectiveness of the proposed methods in reducing query complexity while maintaining performance. The results demonstrate that ADPO is a promising approach for aligning large language models with human preferences efficiently.
Reach us at info@study.space