2024 | Gokul Swamy, Christoph Dann, Rahul Kidambi, Zhiwei Steven Wu, Alekh Agarwal
The paper introduces Self-Play Preference Optimization (SPO), an algorithm for reinforcement learning (RL) that leverages human feedback without the need for a reward model or adversarial training. SPO is designed to handle non-Markovian, intransitive, and stochastic preferences, making it robust to the compounding errors common in offline approaches to sequential prediction. The core idea of SPO is to frame the problem as a two-player zero-sum game, where the preference function acts as the payoff matrix. By exploiting the symmetry of this game, SPO demonstrates that a single agent can play against itself to learn the Minimax Winner (MW), a solution concept that generalizes the Copeland Winner and is robust to intransitive preferences.
SPO is evaluated on a suite of continuous control tasks, showing superior sample efficiency compared to reward-model-based approaches. It also demonstrates robustness to intransitive, stochastic, and non-Markovian preferences, which are common in practical scenarios. The paper provides theoretical guarantees on the convergence properties of SPO, proving that it converges to an approximate MW at a rate comparable to standard no-regret algorithms. Empirical results confirm the effectiveness of SPO in various preference setups, including trajectory-level comparisons based on ground-truth Markovian rewards, stochastic preferences, and non-Markovian preferences.
The paper also discusses the limitations of reward-based RLHF algorithms and shows that SPO can compute MWs in scenarios where there exists a clear optimal policy, converging to it at a fast statistical rate. SPO is implemented using a combination of no-regret algorithms and deep reinforcement learning techniques, making it versatile for a wide range of applications.The paper introduces Self-Play Preference Optimization (SPO), an algorithm for reinforcement learning (RL) that leverages human feedback without the need for a reward model or adversarial training. SPO is designed to handle non-Markovian, intransitive, and stochastic preferences, making it robust to the compounding errors common in offline approaches to sequential prediction. The core idea of SPO is to frame the problem as a two-player zero-sum game, where the preference function acts as the payoff matrix. By exploiting the symmetry of this game, SPO demonstrates that a single agent can play against itself to learn the Minimax Winner (MW), a solution concept that generalizes the Copeland Winner and is robust to intransitive preferences.
SPO is evaluated on a suite of continuous control tasks, showing superior sample efficiency compared to reward-model-based approaches. It also demonstrates robustness to intransitive, stochastic, and non-Markovian preferences, which are common in practical scenarios. The paper provides theoretical guarantees on the convergence properties of SPO, proving that it converges to an approximate MW at a rate comparable to standard no-regret algorithms. Empirical results confirm the effectiveness of SPO in various preference setups, including trajectory-level comparisons based on ground-truth Markovian rewards, stochastic preferences, and non-Markovian preferences.
The paper also discusses the limitations of reward-based RLHF algorithms and shows that SPO can compute MWs in scenarios where there exists a clear optimal policy, converging to it at a fast statistical rate. SPO is implemented using a combination of no-regret algorithms and deep reinforcement learning techniques, making it versatile for a wide range of applications.