[slides and audio] MaxMin-RLHF%3A Alignment with Diverse Human Preferences

The paper "MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences" addresses the challenge of aligning large language models (LLMs) with diverse human preferences using Reinforcement Learning from Human Feedback (RLHF). The authors highlight the limitations of traditional RLHF approaches, which often use a single reward model derived from preference data, leading to an inability to represent the rich diversity of human preferences. They derive an *impossibility result* showing that a single reward model cannot effectively align with diverse human preferences. To address this, they propose a novel approach called MaxMin-RLHF, which learns a mixture of preference distributions using an expectation-maximization (EM) algorithm and introduces a MaxMin alignment objective inspired by the Egalitarian principle in social choice theory. This objective aims to maximize social utility while ensuring fairness across different user groups. The paper provides theoretical justifications and empirical evidence to support the effectiveness of MaxMin-RLHF, demonstrating significant improvements in win-rates and accuracy for minority groups compared to conventional RLHF algorithms. The approach is validated on both small-scale (GPT-2) and large-scale (Tulu2-7B) language models, showing robustness and fairness in handling diverse human preferences.The paper "MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences" addresses the challenge of aligning large language models (LLMs) with diverse human preferences using Reinforcement Learning from Human Feedback (RLHF). The authors highlight the limitations of traditional RLHF approaches, which often use a single reward model derived from preference data, leading to an inability to represent the rich diversity of human preferences. They derive an *impossibility result* showing that a single reward model cannot effectively align with diverse human preferences. To address this, they propose a novel approach called MaxMin-RLHF, which learns a mixture of preference distributions using an expectation-maximization (EM) algorithm and introduces a MaxMin alignment objective inspired by the Egalitarian principle in social choice theory. This objective aims to maximize social utility while ensuring fairness across different user groups. The paper provides theoretical justifications and empirical evidence to support the effectiveness of MaxMin-RLHF, demonstrating significant improvements in win-rates and accuracy for minority groups compared to conventional RLHF algorithms. The approach is validated on both small-scale (GPT-2) and large-scale (Tulu2-7B) language models, showing robustness and fairness in handling diverse human preferences.

MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences

14 Feb 2024 | Souradip Chakraborty *1, Jiahao Qiu*4, Hui Yuan4, Alec Koppel2, Furong Huang1, Dinesh Manocha1, Amrit Singh Bedi3, and Mengdi Wang4

14 Feb 2024 | Souradip Chakraborty 1, Jiahao Qiu4, Hui Yuan4, Alec Koppel2, Furong Huang1, Dinesh Manocha1, Amrit Singh Bedi3, and Mengdi Wang4