Understanding Axioms for AI Alignment from Human Feedback

The paper "Axioms for AI Alignment from Human Feedback" by Luise Ge explores the alignment of AI models with human values, particularly in the context of reinforcement learning from human feedback (RLHF). The authors argue that the current practice of using maximum likelihood estimation (MLE) to derive reward functions from pairwise comparisons made by humans often fails to meet fundamental axioms of social choice theory, such as Pareto optimality (PO) and pairwise majority consistency (PMC). They develop a novel approach called *linear social choice*, which is designed to address these issues. Key findings include: 1. **Linear Rank Aggregation Rules**: The paper examines loss-based rules that optimize a loss function based on pairwise comparisons. These rules, while widely used in RLHF, fail to satisfy both PO and PMC. 2. **Majority-Based Loss Formulation**: A modified loss function that penalizes disagreement with a majority of voters is proposed, which helps achieve PMC but not PO. 3. **Leximax Copeland Subject to PO (LCPO)**: A new linear rank aggregation rule called LCPO is introduced, which combines the Copeland rule with leximax to ensure PO, PMC, majority consistency, and winner monotonicity. This rule is shown to be effective in practice and can be implemented efficiently. The paper concludes by discussing extensions and limitations, emphasizing that the results are theoretical and aim to provide a framework for understanding and comparing rules and methods in AI alignment. It also highlights the importance of further research in this area, particularly in finding methods to convert known voting rules into linear aggregation rules while maintaining their axiomatic properties.The paper "Axioms for AI Alignment from Human Feedback" by Luise Ge explores the alignment of AI models with human values, particularly in the context of reinforcement learning from human feedback (RLHF). The authors argue that the current practice of using maximum likelihood estimation (MLE) to derive reward functions from pairwise comparisons made by humans often fails to meet fundamental axioms of social choice theory, such as Pareto optimality (PO) and pairwise majority consistency (PMC). They develop a novel approach called *linear social choice*, which is designed to address these issues. Key findings include: 1. **Linear Rank Aggregation Rules**: The paper examines loss-based rules that optimize a loss function based on pairwise comparisons. These rules, while widely used in RLHF, fail to satisfy both PO and PMC. 2. **Majority-Based Loss Formulation**: A modified loss function that penalizes disagreement with a majority of voters is proposed, which helps achieve PMC but not PO. 3. **Leximax Copeland Subject to PO (LCPO)**: A new linear rank aggregation rule called LCPO is introduced, which combines the Copeland rule with leximax to ensure PO, PMC, majority consistency, and winner monotonicity. This rule is shown to be effective in practice and can be implemented efficiently. The paper concludes by discussing extensions and limitations, emphasizing that the results are theoretical and aim to provide a framework for understanding and comparing rules and methods in AI alignment. It also highlights the importance of further research in this area, particularly in finding methods to convert known voting rules into linear aggregation rules while maintaining their axiomatic properties.

Axioms for AI Alignment from Human Feedback

7 Nov 2024 | Luise Ge, Daniel Halpern, Evi Micha, Ariel D. Procaccia, Itai Shapira, Yevgeniy Vorobeychik, Junlin Wu