MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences

MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences

14 Feb 2024 | Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Furong Huang, Dinesh Manocha, Amrit Singh Bedi, and Mengdi Wang
MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences This paper presents MaxMin-RLHF, a novel approach to align large language models (LLMs) with diverse human preferences. The authors argue that traditional Reinforcement Learning from Human Feedback (RLHF) methods, which rely on a single reward model, are insufficient to capture the diversity of human preferences. They demonstrate that aligning LLMs with a single reward model is fundamentally impossible due to the diversity in human sub-populations' preferences. To address this, they propose a MaxMin alignment objective inspired by the Egalitarian principle in social choice theory, which aims to maximize the minimum utility across all user groups. The authors derive an impossibility result for single reward RLHF, showing that the alignment gap increases with the diversity of human preferences. They then propose a MaxMin-RLHF algorithm that learns a mixture of preference distributions using the Expectation-Maximization (EM) algorithm. This approach allows for more accurate representation of diverse human preferences by considering the preferences of all sub-populations. The authors conduct extensive experiments on both small-scale (GPT-2) and large-scale (Tulu2-7B) language models. They show that their MaxMin-RLHF approach achieves a significant improvement in alignment performance, with an average improvement of over 16% in win-rates compared to conventional RLHF algorithms. Additionally, the approach improves the win-rate for minority groups by over 33% without compromising the performance of majority groups, demonstrating its robustness and fairness. The authors also highlight the connection of their approach to distributionally robust optimization and general utility RL, emphasizing the generality and robustness of their solution. They argue that their findings are not limited to language models but also extend to reinforcement learning in general. The paper provides a comprehensive theoretical analysis and empirical evaluation of their approach, demonstrating its effectiveness in aligning LLMs with diverse human preferences.MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences This paper presents MaxMin-RLHF, a novel approach to align large language models (LLMs) with diverse human preferences. The authors argue that traditional Reinforcement Learning from Human Feedback (RLHF) methods, which rely on a single reward model, are insufficient to capture the diversity of human preferences. They demonstrate that aligning LLMs with a single reward model is fundamentally impossible due to the diversity in human sub-populations' preferences. To address this, they propose a MaxMin alignment objective inspired by the Egalitarian principle in social choice theory, which aims to maximize the minimum utility across all user groups. The authors derive an impossibility result for single reward RLHF, showing that the alignment gap increases with the diversity of human preferences. They then propose a MaxMin-RLHF algorithm that learns a mixture of preference distributions using the Expectation-Maximization (EM) algorithm. This approach allows for more accurate representation of diverse human preferences by considering the preferences of all sub-populations. The authors conduct extensive experiments on both small-scale (GPT-2) and large-scale (Tulu2-7B) language models. They show that their MaxMin-RLHF approach achieves a significant improvement in alignment performance, with an average improvement of over 16% in win-rates compared to conventional RLHF algorithms. Additionally, the approach improves the win-rate for minority groups by over 33% without compromising the performance of majority groups, demonstrating its robustness and fairness. The authors also highlight the connection of their approach to distributionally robust optimization and general utility RL, emphasizing the generality and robustness of their solution. They argue that their findings are not limited to language models but also extend to reinforcement learning in general. The paper provides a comprehensive theoretical analysis and empirical evaluation of their approach, demonstrating its effectiveness in aligning LLMs with diverse human preferences.
Reach us at info@study.space
[slides and audio] MaxMin-RLHF%3A Alignment with Diverse Human Preferences