8 Mar 2024 | Huiying Zhong*, Zhun Deng†, Weijie J. Su‡, Zhiwei Steven Wu§, Linjun Zhang¶
This paper introduces a theoretical framework for multi-party reinforcement learning with human feedback (RLHF), addressing the challenge of aligning models with diverse and conflicting human preferences. Traditional RLHF methods often fail to capture and balance multiple individuals' preferences, leading to suboptimal policies. To overcome this, the authors incorporate meta-learning to learn multiple reward functions and use social welfare functions to aggregate these preferences. They focus on offline learning settings and establish sample complexity bounds, efficiency, and fairness guarantees for optimizing various social welfare functions such as Nash, Utilitarian, and Leximin welfare functions. The paper highlights the statistical complexity of multi-party RLHF compared to single-party RLHF and extends the analysis to a reward-free setting, providing pessimistic variants of the von Neumann Winner based on offline preference data. The main contributions include a general framework for multi-party alignment, sample complexity bounds, efficiency and fairness guarantees, and extensions to reward-free models. The work demonstrates the advantages of multi-party RLHF while also highlighting its increased computational demands.This paper introduces a theoretical framework for multi-party reinforcement learning with human feedback (RLHF), addressing the challenge of aligning models with diverse and conflicting human preferences. Traditional RLHF methods often fail to capture and balance multiple individuals' preferences, leading to suboptimal policies. To overcome this, the authors incorporate meta-learning to learn multiple reward functions and use social welfare functions to aggregate these preferences. They focus on offline learning settings and establish sample complexity bounds, efficiency, and fairness guarantees for optimizing various social welfare functions such as Nash, Utilitarian, and Leximin welfare functions. The paper highlights the statistical complexity of multi-party RLHF compared to single-party RLHF and extends the analysis to a reward-free setting, providing pessimistic variants of the von Neumann Winner based on offline preference data. The main contributions include a general framework for multi-party alignment, sample complexity bounds, efficiency and fairness guarantees, and extensions to reward-free models. The work demonstrates the advantages of multi-party RLHF while also highlighting its increased computational demands.