RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation

RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation

April 30, 2024 | Chanwoo Park, Mingyang Liu, Dingwen Kong, Kaiqing Zhang, Asuman Ozdaglar
This paper addresses the challenges of Reinforcement Learning from Human Feedback (RLHF) when human preferences are heterogeneous and may be strategically manipulated. The authors propose two frameworks for handling heterogeneous human feedback: a personalization-based framework and a preference-aggregation-based framework. In the personalization-based framework, they use representation learning and clustering to learn multiple reward models that balance bias and variance. They establish sample complexity guarantees for these approaches. In the preference-aggregation-based framework, they aim to adhere to the single-model framework by aggregating diverse and truthful preferences. They propose two approaches based on reward and preference aggregation, with the latter directly aggregating human feedback in the form of probabilistic opinions. Under the probabilistic-opinion-feedback model, they develop an approach to handle strategic human labelers who may bias and manipulate the aggregated preferences. Based on ideas from mechanism design, their approach ensures truthful preference reporting, with the induced aggregation rule maximizing social welfare functions. The paper also discusses related works, including previous studies on RLHF and representation learning, and provides theoretical analyses of their approaches. The authors show that their methods achieve near-optimal sample complexity and provide theoretical guarantees for their algorithms.This paper addresses the challenges of Reinforcement Learning from Human Feedback (RLHF) when human preferences are heterogeneous and may be strategically manipulated. The authors propose two frameworks for handling heterogeneous human feedback: a personalization-based framework and a preference-aggregation-based framework. In the personalization-based framework, they use representation learning and clustering to learn multiple reward models that balance bias and variance. They establish sample complexity guarantees for these approaches. In the preference-aggregation-based framework, they aim to adhere to the single-model framework by aggregating diverse and truthful preferences. They propose two approaches based on reward and preference aggregation, with the latter directly aggregating human feedback in the form of probabilistic opinions. Under the probabilistic-opinion-feedback model, they develop an approach to handle strategic human labelers who may bias and manipulate the aggregated preferences. Based on ideas from mechanism design, their approach ensures truthful preference reporting, with the induced aggregation rule maximizing social welfare functions. The paper also discusses related works, including previous studies on RLHF and representation learning, and provides theoretical analyses of their approaches. The authors show that their methods achieve near-optimal sample complexity and provide theoretical guarantees for their algorithms.
Reach us at info@study.space
Understanding RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation