Understanding Reinforcement Learning from Human Feedback with Active Queries

This paper addresses the challenge of aligning large language models (LLMs) with human preferences using reinforcement learning from human feedback (RLHF). The authors propose query-efficient methods to reduce the amount of human-labeled preference data required, which is costly and inefficient. They formulate the problem as a contextual dueling bandit and design an active-query-based proximal policy optimization (APPO) algorithm with a regret bound of \(\widetilde{O}(d^2/\Delta)\) and a query complexity of \(\widetilde{O}(d^2/\Delta^2)\), where \(d\) is the dimension of the feature space and \(\Delta\) is the sub-optimality gap. They also introduce ADPO, a practical version of APPO based on direct preference optimization (DPO), which significantly reduces the number of queries while maintaining or improving performance. Experiments show that ADPO matches the performance of state-of-the-art DPO methods with only half the number of queries. The paper provides theoretical guarantees and empirical validation, demonstrating the effectiveness of the proposed methods.This paper addresses the challenge of aligning large language models (LLMs) with human preferences using reinforcement learning from human feedback (RLHF). The authors propose query-efficient methods to reduce the amount of human-labeled preference data required, which is costly and inefficient. They formulate the problem as a contextual dueling bandit and design an active-query-based proximal policy optimization (APPO) algorithm with a regret bound of \(\widetilde{O}(d^2/\Delta)\) and a query complexity of \(\widetilde{O}(d^2/\Delta^2)\), where \(d\) is the dimension of the feature space and \(\Delta\) is the sub-optimality gap. They also introduce ADPO, a practical version of APPO based on direct preference optimization (DPO), which significantly reduces the number of queries while maintaining or improving performance. Experiments show that ADPO matches the performance of state-of-the-art DPO methods with only half the number of queries. The paper provides theoretical guarantees and empirical validation, demonstrating the effectiveness of the proposed methods.

Reinforcement Learning from Human Feedback with Active Queries

14 Feb 2024 | Kaixuan Ji*† and Jiafan He*‡ and Quanquan Gu§

14 Feb 2024 | Kaixuan Ji† and Jiafan He‡ and Quanquan Gu§