[slides] Direct Multi-Turn Preference Optimization for Language Agents

This paper addresses the challenge of adapting Large Language Models (LLMs) for agent tasks, particularly in multi-turn scenarios. Direct Preference Optimization (DPO) is a promising technique for this adaptation, but it struggles with multi-turn tasks due to the inability to cancel the partition function. To overcome this, the authors propose a novel loss function called Direct Multi-Turn Preference Optimization (DMPO). DMPO replaces the policy constraint with a state-action occupancy measure (SAOM) constraint in the Reinforcement Learning (RL) objective and introduces length normalization into the Bradley-Terry (BT) model. This approach ensures that the partition function is independent of the current state and addresses length disparities between preferred and dis-preferred trajectories. Extensive experiments on three multi-turn agent task datasets demonstrate the effectiveness and superiority of DMPO over existing methods, showing its ability to mitigate compounding errors and improve performance in noisy and clean settings. The paper also provides theoretical explanations for the efficacy of the length normalization technique and the advantages of the SAOM constraint.This paper addresses the challenge of adapting Large Language Models (LLMs) for agent tasks, particularly in multi-turn scenarios. Direct Preference Optimization (DPO) is a promising technique for this adaptation, but it struggles with multi-turn tasks due to the inability to cancel the partition function. To overcome this, the authors propose a novel loss function called Direct Multi-Turn Preference Optimization (DMPO). DMPO replaces the policy constraint with a state-action occupancy measure (SAOM) constraint in the Reinforcement Learning (RL) objective and introduces length normalization into the Bradley-Terry (BT) model. This approach ensures that the partition function is independent of the current state and addresses length disparities between preferred and dis-preferred trajectories. Extensive experiments on three multi-turn agent task datasets demonstrate the effectiveness and superiority of DMPO over existing methods, showing its ability to mitigate compounding errors and improve performance in noisy and clean settings. The paper also provides theoretical explanations for the efficacy of the length normalization technique and the advantages of the SAOM constraint.

Direct Multi-Turn Preference Optimization for Language Agents

17 Aug 2024 | Wentao Shi1*, Mengqi Yuan1*, Junkang Wu1 Qifan Wang2 Fuli Feng1†

17 Aug 2024 | Wentao Shi1, Mengqi Yuan1, Junkang Wu1 Qifan Wang2 Fuli Feng1†