2024 | Luise Ge, Daniel Halpern, Evi Micha, Ariel D. Procaccia, Itai Shapira, Yevgeniy Vorobeychik, Junlin Wu
This paper investigates the alignment of AI models with human values through the lens of social choice theory, focusing on reinforcement learning from human feedback (RLHF). The authors argue that the standard approach to learning reward functions in RLHF, which relies on maximum likelihood estimation of a random utility model like the Bradley-Terry-Luce (BTL) model, fails to meet key axiomatic standards of fairness and efficiency. They propose a new framework for learning reward functions with strong axiomatic guarantees, based on linear social choice theory.
The paper introduces a linear social choice model where candidates (e.g., prompts and responses) are ranked based on linear reward functions. The authors examine two key axioms: Pareto optimality (PO), which requires that if all voters prefer candidate a over candidate b, then the resulting ranking should also prefer a over b; and pairwise majority consistency (PMC), which requires that if a majority of voters prefer a over b, then the resulting ranking should reflect this preference.
The authors show that standard loss-based methods used in RLHF, including the BTL model, fail to satisfy both PO and PMC. They propose a new linear rank aggregation rule called Leximax Copeland subject to PO (LCPO), which satisfies both PO and PMC, as well as two additional axioms: majority consistency and winner monotonicity. LCPO ensures that the ranking respects the preferences of the majority and maintains consistency when improving a candidate's position among voters.
The paper also discusses the implications of these findings for AI alignment and highlights the importance of incorporating axiomatic principles into the design of RLHF methods. The authors conclude that while widely used rules fail to meet basic axioms, there are alternative methods that offer stronger guarantees, providing a discriminative lens for evaluating RLHF and AI alignment approaches.This paper investigates the alignment of AI models with human values through the lens of social choice theory, focusing on reinforcement learning from human feedback (RLHF). The authors argue that the standard approach to learning reward functions in RLHF, which relies on maximum likelihood estimation of a random utility model like the Bradley-Terry-Luce (BTL) model, fails to meet key axiomatic standards of fairness and efficiency. They propose a new framework for learning reward functions with strong axiomatic guarantees, based on linear social choice theory.
The paper introduces a linear social choice model where candidates (e.g., prompts and responses) are ranked based on linear reward functions. The authors examine two key axioms: Pareto optimality (PO), which requires that if all voters prefer candidate a over candidate b, then the resulting ranking should also prefer a over b; and pairwise majority consistency (PMC), which requires that if a majority of voters prefer a over b, then the resulting ranking should reflect this preference.
The authors show that standard loss-based methods used in RLHF, including the BTL model, fail to satisfy both PO and PMC. They propose a new linear rank aggregation rule called Leximax Copeland subject to PO (LCPO), which satisfies both PO and PMC, as well as two additional axioms: majority consistency and winner monotonicity. LCPO ensures that the ranking respects the preferences of the majority and maintains consistency when improving a candidate's position among voters.
The paper also discusses the implications of these findings for AI alignment and highlights the importance of incorporating axiomatic principles into the design of RLHF methods. The authors conclude that while widely used rules fail to meet basic axioms, there are alternative methods that offer stronger guarantees, providing a discriminative lens for evaluating RLHF and AI alignment approaches.