Direct Multi-Turn Preference Optimization for Language Agents

Direct Multi-Turn Preference Optimization for Language Agents

17 Aug 2024 | Wentao Shi, Mengqi Yuan, Junkang Wu, Qifan Wang, Fuli Feng
This paper introduces a novel loss function called DMPO (Direct Multi-Turn Preference Optimization) for language agents, which directly optimizes the Reinforcement Learning (RL) objective in multi-turn scenarios. The key idea is to replace the policy constraint with the state-action occupancy measure (SAOM) constraint in the RL objective and introduce length normalization into the Bradley-Terry (BT) model. This approach eliminates the partition function in the BT model, which is a major challenge in multi-turn tasks, and allows for more effective optimization of the RL objective. The SAOM constraint helps mitigate compounding errors, which are a common issue in preference-based reinforcement learning. The DMPO loss function is derived theoretically and validated through extensive experiments on three multi-turn agent task datasets. The results show that DMPO outperforms existing methods in terms of effectiveness and robustness, particularly in noisy environments. The paper also provides a theoretical explanation for the efficacy of the length normalization technique in DPO loss. The DMPO loss function is shown to be effective in reducing compounding errors and improving performance in multi-turn agent tasks. The experiments demonstrate that DMPO achieves significant improvements over other baselines, especially in the clean setting. The paper also discusses the limitations of the current approach, including the focus on turn-wise task formulation and the use of 7B-sized models. Overall, the DMPO loss function offers a promising solution for improving the performance of language agents in multi-turn tasks.This paper introduces a novel loss function called DMPO (Direct Multi-Turn Preference Optimization) for language agents, which directly optimizes the Reinforcement Learning (RL) objective in multi-turn scenarios. The key idea is to replace the policy constraint with the state-action occupancy measure (SAOM) constraint in the RL objective and introduce length normalization into the Bradley-Terry (BT) model. This approach eliminates the partition function in the BT model, which is a major challenge in multi-turn tasks, and allows for more effective optimization of the RL objective. The SAOM constraint helps mitigate compounding errors, which are a common issue in preference-based reinforcement learning. The DMPO loss function is derived theoretically and validated through extensive experiments on three multi-turn agent task datasets. The results show that DMPO outperforms existing methods in terms of effectiveness and robustness, particularly in noisy environments. The paper also provides a theoretical explanation for the efficacy of the length normalization technique in DPO loss. The DMPO loss function is shown to be effective in reducing compounding errors and improving performance in multi-turn agent tasks. The experiments demonstrate that DMPO achieves significant improvements over other baselines, especially in the clean setting. The paper also discusses the limitations of the current approach, including the focus on turn-wise task formulation and the use of 7B-sized models. Overall, the DMPO loss function offers a promising solution for improving the performance of language agents in multi-turn tasks.
Reach us at info@study.space