Active Preference Optimization for Sample Efficient RLHF

Active Preference Optimization for Sample Efficient RLHF

5 Jun 2024 | Nirjhar Das, Souradip Chakraborty, Aldo Pacchiano, Sayak Ray Chowdhury
Active Preference Optimization for Sample Efficient RLHF Reinforcement Learning from Human Feedback (RLHF) is pivotal in aligning Large Language Models (LLMs) with human preferences. However, the reliance on high-quality human preference data creates a costly bottleneck in practical applications. Current methods uniformly select prompt-generation pairs from a dataset, leading to sub-optimal alignment under constrained budgets. This paper proposes Active Preference Optimization (APO), an active learning algorithm that significantly enhances model alignment by querying preference data from the most important samples, achieving superior performance at a small sample budget. APO is formulated within the contextual preference bandit framework, treating prompts as contexts. Theoretical analysis shows that the suboptimality gap of the policy learned via APO scales as O(1/√T) for a sample budget of T. Experimental evaluations on practical preference datasets validate APO's efficacy over existing methods, establishing it as a sample-efficient and practical solution for alignment. APO actively selects contexts and actions to reduce suboptimality gap, outperforming uniform sampling and other baselines. The algorithm is shown to achieve a constant suboptimality gap with uniform sampling, highlighting the need for adaptive strategies. APO's performance is demonstrated on sentiment generation and dialogue tasks, showing significant improvements in reward learning and alignment. The work contributes to a sample-efficient and practical solution for preference data collection in RLHF.Active Preference Optimization for Sample Efficient RLHF Reinforcement Learning from Human Feedback (RLHF) is pivotal in aligning Large Language Models (LLMs) with human preferences. However, the reliance on high-quality human preference data creates a costly bottleneck in practical applications. Current methods uniformly select prompt-generation pairs from a dataset, leading to sub-optimal alignment under constrained budgets. This paper proposes Active Preference Optimization (APO), an active learning algorithm that significantly enhances model alignment by querying preference data from the most important samples, achieving superior performance at a small sample budget. APO is formulated within the contextual preference bandit framework, treating prompts as contexts. Theoretical analysis shows that the suboptimality gap of the policy learned via APO scales as O(1/√T) for a sample budget of T. Experimental evaluations on practical preference datasets validate APO's efficacy over existing methods, establishing it as a sample-efficient and practical solution for alignment. APO actively selects contexts and actions to reduce suboptimality gap, outperforming uniform sampling and other baselines. The algorithm is shown to achieve a constant suboptimality gap with uniform sampling, highlighting the need for adaptive strategies. APO's performance is demonstrated on sentiment generation and dialogue tasks, showing significant improvements in reward learning and alignment. The work contributes to a sample-efficient and practical solution for preference data collection in RLHF.
Reach us at info@study.space