Understanding RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation

This paper addresses the challenges of handling heterogeneous human preferences in Reinforcement Learning from Human Feedback (RLHF). It proposes two frameworks: a personalization-based approach and a preference-aggregation-based approach. The personalization-based approach includes two methods: one based on representation learning and another on clustering. Representation learning learns multiple reward models by assuming a common representation, while clustering segments users into clusters and learns a reward model for each cluster. The preference-aggregation-based approach aims to aggregate diverse and truthful human preferences into a single model, using social choice theory and probabilistic opinion pooling. The paper also addresses strategic behavior by designing mechanisms to incentivize truthful preference reporting. Theoretical guarantees and sample complexity analyses are provided for both approaches, demonstrating their effectiveness in handling heterogeneous human feedback.This paper addresses the challenges of handling heterogeneous human preferences in Reinforcement Learning from Human Feedback (RLHF). It proposes two frameworks: a personalization-based approach and a preference-aggregation-based approach. The personalization-based approach includes two methods: one based on representation learning and another on clustering. Representation learning learns multiple reward models by assuming a common representation, while clustering segments users into clusters and learns a reward model for each cluster. The preference-aggregation-based approach aims to aggregate diverse and truthful human preferences into a single model, using social choice theory and probabilistic opinion pooling. The paper also addresses strategic behavior by designing mechanisms to incentivize truthful preference reporting. Theoretical guarantees and sample complexity analyses are provided for both approaches, demonstrating their effectiveness in handling heterogeneous human feedback.

RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation

April 30, 2024 | Chanwoo Park, Mingyang Liu, Dingwen Kong, Kaiqing Zhang, Asuman Ozdaglar