Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement

Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement

17 Jun 2024 | Weimin Xiong, Yifan Song, Xiutian Zhao, Wenhao Wu, Xun Wang, Ke Wang, Cheng Li, Wei Peng, Sujian Li
This paper introduces the Iterative step-level Process Refinement (IPR) framework for training large language model (LLM) agents to perform complex interactive tasks. The IPR framework provides detailed step-by-step guidance to enhance agent training by estimating step-level rewards using the Monte Carlo method. During each iteration, the agent explores along the expert trajectory and generates new actions. These actions are evaluated against the corresponding step of the expert trajectory using step-level rewards, which helps identify discrepancies and generate contrastive action pairs for training. The framework also incorporates iterative agent optimization, using a combination of outcome-level direct preference optimization (DPO), step-level DPO, and supervised fine-tuning (SFT) losses to enhance the agent's action capabilities at each step. Experiments on three complex agent tasks—WebShop, InterCodeSQL, and ALFWorld—demonstrate that the IPR framework outperforms a variety of strong baselines. The results show that IPR significantly improves the agent's performance, with improvements of 5.8%, 7.2%, and 3.2% on WebShop, InterCodeSQL, and ALFWorld, respectively. Additionally, the framework's analytical findings highlight its effectiveness in augmenting action efficiency and its applicability to diverse models. The IPR framework addresses two key challenges in training LLM agents: the lack of step-level process supervision in existing environments and the difficulty of effectively utilizing step rewards to enhance agent training, especially for tasks with long trajectories and complex action spaces. The framework's step-level reward acquisition method uses Monte Carlo sampling to estimate rewards, while the iterative agent optimization component refines the agent's actions through a cyclical process. The framework's ability to provide fine-grained guidance at each step enables the agent to take more accurate actions, leading to improved performance in complex interactive tasks.This paper introduces the Iterative step-level Process Refinement (IPR) framework for training large language model (LLM) agents to perform complex interactive tasks. The IPR framework provides detailed step-by-step guidance to enhance agent training by estimating step-level rewards using the Monte Carlo method. During each iteration, the agent explores along the expert trajectory and generates new actions. These actions are evaluated against the corresponding step of the expert trajectory using step-level rewards, which helps identify discrepancies and generate contrastive action pairs for training. The framework also incorporates iterative agent optimization, using a combination of outcome-level direct preference optimization (DPO), step-level DPO, and supervised fine-tuning (SFT) losses to enhance the agent's action capabilities at each step. Experiments on three complex agent tasks—WebShop, InterCodeSQL, and ALFWorld—demonstrate that the IPR framework outperforms a variety of strong baselines. The results show that IPR significantly improves the agent's performance, with improvements of 5.8%, 7.2%, and 3.2% on WebShop, InterCodeSQL, and ALFWorld, respectively. Additionally, the framework's analytical findings highlight its effectiveness in augmenting action efficiency and its applicability to diverse models. The IPR framework addresses two key challenges in training LLM agents: the lack of step-level process supervision in existing environments and the difficulty of effectively utilizing step rewards to enhance agent training, especially for tasks with long trajectories and complex action spaces. The framework's step-level reward acquisition method uses Monte Carlo sampling to estimate rewards, while the iterative agent optimization component refines the agent's actions through a cyclical process. The framework's ability to provide fine-grained guidance at each step enables the agent to take more accurate actions, leading to improved performance in complex interactive tasks.
Reach us at info@study.space