May 30, 2024 | Tengyang Xie*, Dylan J. Foster*, Akshay Krishnamurthy, Corby Rosset, Ahmed Awadallah, Alexander Rakhlin
Exploratory Preference Optimization (XPO) is a novel algorithm for online exploration in Reinforcement Learning from Human Feedback (RLHF). XPO is a simple modification of the Direct Preference Optimization (DPO) algorithm, which introduces a novel exploration bonus to encourage the model to generate diverse and informative responses. This allows XPO to explore beyond the initial model's support and human feedback data. Theoretically, XPO is proven to be sample-efficient and converges to a near-optimal language model policy under natural exploration conditions, regardless of whether the initial model has good coverage. XPO's design combines techniques from language modeling and theoretical reinforcement learning through the lens of KL-regularized Markov decision processes. Empirically, XPO is more sample-efficient than non-exploratory DPO variants in preliminary evaluations. XPO addresses the challenge of efficiently navigating the vast space of token sequences to find responses with maximally informative feedback. It leverages the KL-regularized regret decomposition and global optimism to achieve sample efficiency and convergence. XPO's theoretical guarantees hold for any reinforcement learning problem with a stochastic starting state and deterministic transition dynamics. Empirical results show that XPO can match the performance of DPO variants using significantly less preference data. XPO is practical, provable, and empirically efficient, offering the first practical and provably sample-efficient online exploration algorithm for RLHF with general function approximation.Exploratory Preference Optimization (XPO) is a novel algorithm for online exploration in Reinforcement Learning from Human Feedback (RLHF). XPO is a simple modification of the Direct Preference Optimization (DPO) algorithm, which introduces a novel exploration bonus to encourage the model to generate diverse and informative responses. This allows XPO to explore beyond the initial model's support and human feedback data. Theoretically, XPO is proven to be sample-efficient and converges to a near-optimal language model policy under natural exploration conditions, regardless of whether the initial model has good coverage. XPO's design combines techniques from language modeling and theoretical reinforcement learning through the lens of KL-regularized Markov decision processes. Empirically, XPO is more sample-efficient than non-exploratory DPO variants in preliminary evaluations. XPO addresses the challenge of efficiently navigating the vast space of token sequences to find responses with maximally informative feedback. It leverages the KL-regularized regret decomposition and global optimism to achieve sample efficiency and convergence. XPO's theoretical guarantees hold for any reinforcement learning problem with a stochastic starting state and deterministic transition dynamics. Empirical results show that XPO can match the performance of DPO variants using significantly less preference data. XPO is practical, provable, and empirically efficient, offering the first practical and provably sample-efficient online exploration algorithm for RLHF with general function approximation.