A Minimaxalist Approach to Reinforcement Learning from Human Feedback

A Minimaxalist Approach to Reinforcement Learning from Human Feedback

2024 | Gokul Swamy, Christoph Dann, Rahul Kidambi, Zhiwei Steven Wu, Alekh Agarwal
This paper introduces Self-Play Preference Optimization (SPO), a novel approach to reinforcement learning from human feedback (RLHF). SPO is a minimalist algorithm that does not require training a reward model or adversarial training, making it simple to implement. It is also maximalist in that it can handle non-Markovian, intransitive, and stochastic preferences, and is robust to compounding errors that plague offline methods. SPO leverages the concept of a Minimax Winner (MW), a preference aggregation concept from social choice theory, to frame learning from preferences as a zero-sum game between two policies. By exploiting the symmetry of this game, SPO can train a single agent against itself, ensuring strong convergence guarantees. SPO works by sampling multiple trajectories from a policy, asking a preference or teacher model to compare them, and using the proportion of wins as the reward for a particular trajectory. This approach is more robust to intransitive, non-Markovian, and noisy preferences than prior methods. The paper demonstrates that SPO is more sample-efficient than reward-model based approaches on a suite of continuous control tasks, and is able to learn comparable policies with stochastic preferences without the need for an extra model. The paper also shows that SPO can handle complex non-Markovian preferences, such as learning to maximize rewards over the first three-quarters of a trajectory while not crossing a threshold on returns in the last quarter, despite searching over a class of Markovian policies. Additionally, SPO is able to compute a Minimax Winner consistently across problem instances, even when preferences are intransitive and generated by aggregating sub-populations. Theoretical analysis shows that SPO converges to an approximate Minimax Winner at the rate of the underlying no-regret algorithm, and when an underlying reward function exists, it converges to the optimal policy at a fast rate. The paper also provides a practical version of SPO, which uses a queue of trajectories to compute rewards based on the win rate against the queue. This makes SPO lightweight to implement on top of any policy optimization method. Experiments show that SPO outperforms reward-model based approaches on tasks with intransitive preferences, and is more robust to stochastic preferences. SPO is also able to handle non-Markovian preferences, such as maximizing rewards subject to a constraint on the total reward in the last quarter of a trajectory. The paper concludes that SPO is a simple, effective, and theoretically sound approach to RLHF that can handle a wide range of preference structures.This paper introduces Self-Play Preference Optimization (SPO), a novel approach to reinforcement learning from human feedback (RLHF). SPO is a minimalist algorithm that does not require training a reward model or adversarial training, making it simple to implement. It is also maximalist in that it can handle non-Markovian, intransitive, and stochastic preferences, and is robust to compounding errors that plague offline methods. SPO leverages the concept of a Minimax Winner (MW), a preference aggregation concept from social choice theory, to frame learning from preferences as a zero-sum game between two policies. By exploiting the symmetry of this game, SPO can train a single agent against itself, ensuring strong convergence guarantees. SPO works by sampling multiple trajectories from a policy, asking a preference or teacher model to compare them, and using the proportion of wins as the reward for a particular trajectory. This approach is more robust to intransitive, non-Markovian, and noisy preferences than prior methods. The paper demonstrates that SPO is more sample-efficient than reward-model based approaches on a suite of continuous control tasks, and is able to learn comparable policies with stochastic preferences without the need for an extra model. The paper also shows that SPO can handle complex non-Markovian preferences, such as learning to maximize rewards over the first three-quarters of a trajectory while not crossing a threshold on returns in the last quarter, despite searching over a class of Markovian policies. Additionally, SPO is able to compute a Minimax Winner consistently across problem instances, even when preferences are intransitive and generated by aggregating sub-populations. Theoretical analysis shows that SPO converges to an approximate Minimax Winner at the rate of the underlying no-regret algorithm, and when an underlying reward function exists, it converges to the optimal policy at a fast rate. The paper also provides a practical version of SPO, which uses a queue of trajectories to compute rewards based on the win rate against the queue. This makes SPO lightweight to implement on top of any policy optimization method. Experiments show that SPO outperforms reward-model based approaches on tasks with intransitive preferences, and is more robust to stochastic preferences. SPO is also able to handle non-Markovian preferences, such as maximizing rewards subject to a constraint on the total reward in the last quarter of a trajectory. The paper concludes that SPO is a simple, effective, and theoretically sound approach to RLHF that can handle a wide range of preference structures.
Reach us at info@study.space
[slides] A Minimaximalist Approach to Reinforcement Learning from Human Feedback | StudySpace