Mapping Social Choice Theory to RLHF

Mapping Social Choice Theory to RLHF

19 Apr 2024 | Jessica Dai* and Eve Fleisig*
The paper "Mapping Social Choice Theory to RLHF" by Jessica Dai and Eve Fleisig explores the application of social choice theory (SCT) to the problem of incorporating human preferences into reinforcement learning from human feedback (RLHF). The authors highlight the similarities and differences between the settings of RLHF and SCT, and discuss how these differences affect the interpretation of well-known technical results in SCT within the context of RLHF. Key points include: 1. **Problem Settings**: RLHF involves learning a reward function based on human preferences, while SCT deals with aggregating preferences from voters to select a winner. 2. **Core Differences**: RLHF has an infinite space of alternatives and evaluators, whereas SCT assumes a finite set. Evaluators in RLHF may not be representative, and their preferences can be influenced by cognitive biases. 3. **Axiomatic Approaches**: The paper proposes new axioms for the RLHF setting, such as $(a, \varepsilon)$-unanimity, $(a, \varepsilon)$-Condorcet consistency, and $(a, \varepsilon)$-consistency, which account for the context-specific nature of preferences. 4. **Distortion**: The concept of distortion, which measures the suboptimality of a voting rule, is adapted to the RLHF setting to account for hidden biases and incomplete information. 5. **Implications**: The paper discusses how these perspectives can help address open problems in RLHF, such as handling diverse evaluators, cognitive biases, and non-representative samples. The authors argue that while some classic SCT axioms may not directly apply to RLHF, adapting them to the specific context can provide valuable insights and tools for analyzing and improving RLHF systems.The paper "Mapping Social Choice Theory to RLHF" by Jessica Dai and Eve Fleisig explores the application of social choice theory (SCT) to the problem of incorporating human preferences into reinforcement learning from human feedback (RLHF). The authors highlight the similarities and differences between the settings of RLHF and SCT, and discuss how these differences affect the interpretation of well-known technical results in SCT within the context of RLHF. Key points include: 1. **Problem Settings**: RLHF involves learning a reward function based on human preferences, while SCT deals with aggregating preferences from voters to select a winner. 2. **Core Differences**: RLHF has an infinite space of alternatives and evaluators, whereas SCT assumes a finite set. Evaluators in RLHF may not be representative, and their preferences can be influenced by cognitive biases. 3. **Axiomatic Approaches**: The paper proposes new axioms for the RLHF setting, such as $(a, \varepsilon)$-unanimity, $(a, \varepsilon)$-Condorcet consistency, and $(a, \varepsilon)$-consistency, which account for the context-specific nature of preferences. 4. **Distortion**: The concept of distortion, which measures the suboptimality of a voting rule, is adapted to the RLHF setting to account for hidden biases and incomplete information. 5. **Implications**: The paper discusses how these perspectives can help address open problems in RLHF, such as handling diverse evaluators, cognitive biases, and non-representative samples. The authors argue that while some classic SCT axioms may not directly apply to RLHF, adapting them to the specific context can provide valuable insights and tools for analyzing and improving RLHF systems.
Reach us at info@study.space