Active Preference Optimization for Sample Efficient RLHF

Active Preference Optimization for Sample Efficient RLHF

5 Jun 2024 | Nirjhar Das, Souradip Chakraborty, Aldo Pacchiano, Sayak Ray Chowdhury
Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning Large Language Models (LLMs) with human preferences. However, the reliance on high-quality human preference data creates a significant bottleneck in practical applications. Current methods often uniformly sample prompt-generation pairs, leading to sub-optimal alignment under limited budgets. Recent works have attempted to address this issue by designing heuristics based on generation uncertainty, but these methods either have restrictive assumptions or lack rigorous theoretical guarantees. To tackle these challenges, the authors reformulate RLHF within the contextual preference bandit framework, treating prompts as contexts. They develop an active-learning algorithm called *Active Preference Optimization* (APO), which significantly enhances model alignment by querying preference data from the most important samples. APO is analyzed under the Bradley-Terry-Luce (BTL) preference model, showing that the sub-optimality gap scales as \(O(1/\sqrt{T})\) for a sample budget of \(T\). The authors also demonstrate that uniformly random sampling of prompts leads to a constant sub-optimality gap. Experimental evaluations on practical preference datasets validate APO's efficacy over existing methods, establishing it as a sample-efficient and practical solution for alignment in a cost-effective and scalable manner. The algorithm is shown to outperform uniform sampling in both reward learning and alignment steps, with significant improvements in performance on datasets like IMDb sentiment and Anthropic-HH.Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning Large Language Models (LLMs) with human preferences. However, the reliance on high-quality human preference data creates a significant bottleneck in practical applications. Current methods often uniformly sample prompt-generation pairs, leading to sub-optimal alignment under limited budgets. Recent works have attempted to address this issue by designing heuristics based on generation uncertainty, but these methods either have restrictive assumptions or lack rigorous theoretical guarantees. To tackle these challenges, the authors reformulate RLHF within the contextual preference bandit framework, treating prompts as contexts. They develop an active-learning algorithm called *Active Preference Optimization* (APO), which significantly enhances model alignment by querying preference data from the most important samples. APO is analyzed under the Bradley-Terry-Luce (BTL) preference model, showing that the sub-optimality gap scales as \(O(1/\sqrt{T})\) for a sample budget of \(T\). The authors also demonstrate that uniformly random sampling of prompts leads to a constant sub-optimality gap. Experimental evaluations on practical preference datasets validate APO's efficacy over existing methods, establishing it as a sample-efficient and practical solution for alignment in a cost-effective and scalable manner. The algorithm is shown to outperform uniform sampling in both reward learning and alignment steps, with significant improvements in performance on datasets like IMDb sentiment and Anthropic-HH.
Reach us at info@study.space