This paper explores the intersection between social choice theory (SCT) and reinforcement learning from human feedback (RLHF), analyzing how SCT can inform the design and evaluation of RLHF systems. RLHF aims to incorporate human preferences into AI models by learning a reward function from human feedback. This process shares similarities with social choice scenarios, where voters express preferences over alternatives, and a voting rule aggregates these preferences to determine an outcome. However, key differences exist between the two settings, such as the nature of the alternatives, the role of evaluators, and the goals of the aggregation process.
The paper identifies core differences between RLHF and SCT, including the infinite nature of alternatives in RLHF, the limited scope of evaluators, and the focus on scoring new alternatives rather than selecting a single winner. It proposes SCT-style axioms for RLHF, such as unanimity, consistency, and Condorcet consistency, which can help evaluate the performance of preference modeling in RLHF. These axioms are adapted to the RLHF context, where the goal is to assign real-valued rewards to alternatives rather than selecting a single winner.
The paper also discusses alternative perspectives for evaluating RLHF, including generalization, axiomatic approaches, and distortion. Generalization refers to the ability of RLHF to extrapolate preferences to new alternatives, while distortion measures the suboptimality of a voting rule in the absence of full information. The paper argues that while some classic SCT axioms may not directly apply to RLHF, they can still provide valuable insights into the design of preference models.
The analysis highlights the importance of considering the context in which RLHF is applied, as the desirability of certain axioms depends on the specific goals and constraints of the task. By drawing on SCT, the paper provides a framework for understanding and improving RLHF systems, particularly in scenarios where human preferences are diverse, noisy, or difficult to aggregate. The paper concludes that integrating insights from SCT can help address key challenges in RLHF, such as ensuring fairness, robustness, and alignment with human values.This paper explores the intersection between social choice theory (SCT) and reinforcement learning from human feedback (RLHF), analyzing how SCT can inform the design and evaluation of RLHF systems. RLHF aims to incorporate human preferences into AI models by learning a reward function from human feedback. This process shares similarities with social choice scenarios, where voters express preferences over alternatives, and a voting rule aggregates these preferences to determine an outcome. However, key differences exist between the two settings, such as the nature of the alternatives, the role of evaluators, and the goals of the aggregation process.
The paper identifies core differences between RLHF and SCT, including the infinite nature of alternatives in RLHF, the limited scope of evaluators, and the focus on scoring new alternatives rather than selecting a single winner. It proposes SCT-style axioms for RLHF, such as unanimity, consistency, and Condorcet consistency, which can help evaluate the performance of preference modeling in RLHF. These axioms are adapted to the RLHF context, where the goal is to assign real-valued rewards to alternatives rather than selecting a single winner.
The paper also discusses alternative perspectives for evaluating RLHF, including generalization, axiomatic approaches, and distortion. Generalization refers to the ability of RLHF to extrapolate preferences to new alternatives, while distortion measures the suboptimality of a voting rule in the absence of full information. The paper argues that while some classic SCT axioms may not directly apply to RLHF, they can still provide valuable insights into the design of preference models.
The analysis highlights the importance of considering the context in which RLHF is applied, as the desirability of certain axioms depends on the specific goals and constraints of the task. By drawing on SCT, the paper provides a framework for understanding and improving RLHF systems, particularly in scenarios where human preferences are diverse, noisy, or difficult to aggregate. The paper concludes that integrating insights from SCT can help address key challenges in RLHF, such as ensuring fairness, robustness, and alignment with human values.