Understanding Multi-turn Reinforcement Learning from Preference Human Feedback

This paper addresses the limitations of existing Reinforcement Learning from Human Feedback (RLHF) methods, which primarily focus on single-turn interactions and lack the ability to handle multi-turn, long-term planning tasks. The authors propose a novel approach, Multi-turn Preference Optimization (MTPO), which is designed to optimize policies based on preference feedback from two full multi-turn conversations. MTPO is based on mirror descent and self-play, and it is proven to converge to a Nash equilibrium. The paper introduces a new environment, the Education Dialogue, where a teacher agent guides a student agent in learning a topic, and evaluates MTPO's performance against single-turn baselines and reward-based RLHF methods. The results show that MTPO outperforms single-turn baselines and is comparable to reward-based RLHF, even though it relies solely on preference feedback. The paper also demonstrates that MTPO can recover the same performance as reward-based RL when explicit rewards are available. The authors release the data and prompts used to create the Education Dialogue environment, contributing to the advancement of multi-turn reinforcement learning.This paper addresses the limitations of existing Reinforcement Learning from Human Feedback (RLHF) methods, which primarily focus on single-turn interactions and lack the ability to handle multi-turn, long-term planning tasks. The authors propose a novel approach, Multi-turn Preference Optimization (MTPO), which is designed to optimize policies based on preference feedback from two full multi-turn conversations. MTPO is based on mirror descent and self-play, and it is proven to converge to a Nash equilibrium. The paper introduces a new environment, the Education Dialogue, where a teacher agent guides a student agent in learning a topic, and evaluates MTPO's performance against single-turn baselines and reward-based RLHF methods. The results show that MTPO outperforms single-turn baselines and is comparable to reward-based RLHF, even though it relies solely on preference feedback. The paper also demonstrates that MTPO can recover the same performance as reward-based RL when explicit rewards are available. The authors release the data and prompts used to create the Education Dialogue environment, contributing to the advancement of multi-turn reinforcement learning.

Multi-turn Reinforcement Learning from Preference Human Feedback

23 May 2024 | Lior Shani, Aviv Rosenberg, Asaf Cassel, Oran Lang, Daniele Calandriello, Avital Zipori, Hila Noga, Orgad Keller, Bilal Piot, Idan Szpektor, Avinatan Hassidim, Yossi Matias, Rémi Munos