Multi-turn Reinforcement Learning from Preference Human Feedback

Multi-turn Reinforcement Learning from Preference Human Feedback

23 May 2024 | Lior Shani, Aviv Rosenberg, Asaf Cassel, Oran Lang, Daniele Calandriello, Avital Zipori, Hila Noga, Orgad Keller, Bilal Piot, Idan Szpektor, Avinatan Hassidim, Yossi Matias, Rémi Munos
This paper addresses the limitations of existing Reinforcement Learning from Human Feedback (RLHF) methods, which primarily focus on single-turn interactions and lack the ability to handle multi-turn, long-term planning tasks. The authors propose a novel approach, Multi-turn Preference Optimization (MTPO), which is designed to optimize policies based on preference feedback from two full multi-turn conversations. MTPO is based on mirror descent and self-play, and it is proven to converge to a Nash equilibrium. The paper introduces a new environment, the Education Dialogue, where a teacher agent guides a student agent in learning a topic, and evaluates MTPO's performance against single-turn baselines and reward-based RLHF methods. The results show that MTPO outperforms single-turn baselines and is comparable to reward-based RLHF, even though it relies solely on preference feedback. The paper also demonstrates that MTPO can recover the same performance as reward-based RL when explicit rewards are available. The authors release the data and prompts used to create the Education Dialogue environment, contributing to the advancement of multi-turn reinforcement learning.This paper addresses the limitations of existing Reinforcement Learning from Human Feedback (RLHF) methods, which primarily focus on single-turn interactions and lack the ability to handle multi-turn, long-term planning tasks. The authors propose a novel approach, Multi-turn Preference Optimization (MTPO), which is designed to optimize policies based on preference feedback from two full multi-turn conversations. MTPO is based on mirror descent and self-play, and it is proven to converge to a Nash equilibrium. The paper introduces a new environment, the Education Dialogue, where a teacher agent guides a student agent in learning a topic, and evaluates MTPO's performance against single-turn baselines and reward-based RLHF methods. The results show that MTPO outperforms single-turn baselines and is comparable to reward-based RLHF, even though it relies solely on preference feedback. The paper also demonstrates that MTPO can recover the same performance as reward-based RL when explicit rewards are available. The authors release the data and prompts used to create the Education Dialogue environment, contributing to the advancement of multi-turn reinforcement learning.
Reach us at info@study.space