Understanding Exploratory Preference Optimization%3A Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF

**Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF** Reinforcement Learning from Human Feedback (RLHF) is a powerful tool for aligning language models with human values. Online exploration in RLHF leverages interactive feedback to encourage diverse and informative responses, potentially enabling novel capabilities. However, existing methods struggle with computational and statistical bottlenecks, limiting their effectiveness. The paper introduces *Exploratory Preference Optimization* (XPO), a new algorithm for online exploration in RLHF. XPO is a simple and practical extension of Direct Preference Optimization (DPO), adding a principled exploration bonus to the DPO objective. This bonus enables the algorithm to explore outside the initial model's support, enhancing its ability to discover novel behaviors. Theoretical analysis shows that XPO is provably sample-efficient and converges to a near-optimal policy under natural exploration conditions, regardless of the initial model's coverage. The key insight is that DPO implicitly performs Q*-approximation, and XPO combines this with global optimism to achieve efficient exploration. Empirical results demonstrate that XPO outperforms non-exploratory DPO variants, requiring significantly less preference data to match or surpass the performance of more data-intensive methods. The paper also discusses the broader implications of the theoretical framework, highlighting its potential for other reinforcement learning problems with stochastic starting states and deterministic transition dynamics. Overall, XPO represents a significant advancement in sample-efficient online exploration for RLHF, offering a practical and theoretically grounded approach to aligning language models with human values.**Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF** Reinforcement Learning from Human Feedback (RLHF) is a powerful tool for aligning language models with human values. Online exploration in RLHF leverages interactive feedback to encourage diverse and informative responses, potentially enabling novel capabilities. However, existing methods struggle with computational and statistical bottlenecks, limiting their effectiveness. The paper introduces *Exploratory Preference Optimization* (XPO), a new algorithm for online exploration in RLHF. XPO is a simple and practical extension of Direct Preference Optimization (DPO), adding a principled exploration bonus to the DPO objective. This bonus enables the algorithm to explore outside the initial model's support, enhancing its ability to discover novel behaviors. Theoretical analysis shows that XPO is provably sample-efficient and converges to a near-optimal policy under natural exploration conditions, regardless of the initial model's coverage. The key insight is that DPO implicitly performs Q*-approximation, and XPO combines this with global optimism to achieve efficient exploration. Empirical results demonstrate that XPO outperforms non-exploratory DPO variants, requiring significantly less preference data to match or surpass the performance of more data-intensive methods. The paper also discusses the broader implications of the theoretical framework, highlighting its potential for other reinforcement learning problems with stochastic starting states and deterministic transition dynamics. Overall, XPO represents a significant advancement in sample-efficient online exploration for RLHF, offering a practical and theoretically grounded approach to aligning language models with human values.

Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF

May 30, 2024 | Tengyang Xie*, Dylan J. Foster*, Akshay Krishnamurthy, Corby Rosset, Ahmed Awadallah, Alexander Rakhlin

May 30, 2024 | Tengyang Xie, Dylan J. Foster, Akshay Krishnamurthy, Corby Rosset, Ahmed Awadallah, Alexander Rakhlin