Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement

Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement

17 Jun 2024 | Weimin Xiong, Yifan Song, Xiutian Zhao, Wenhao Wu, Xun Wang, Ke Wang, Cheng Li, Wei Peng, Sujian Li
The paper introduces the Iterative Step-level Process Refinement (IPR) framework, which enhances the training of large language model (LLM) agents by providing detailed step-by-step guidance. Unlike existing methods that focus on outcome rewards, IPR uses the Monte Carlo (MC) method to estimate step-level rewards, allowing the agent to explore and generate new actions. These actions are then evaluated against the corresponding steps of the expert trajectory, identifying discrepancies and creating contrastive action pairs for training. The agent's performance is improved through a combination of outcome-level direct preference optimization (DPO), step-level DPO, and supervised fine-tuning (SFT) losses. Experiments on three complex agent tasks—WebShop, InterCodeSQL, and ALFWorld—show that IPR outperforms various strong baselines, demonstrating its effectiveness in enhancing action efficiency and generalization. The paper also discusses the limitations of the method and suggests future directions for improvement.The paper introduces the Iterative Step-level Process Refinement (IPR) framework, which enhances the training of large language model (LLM) agents by providing detailed step-by-step guidance. Unlike existing methods that focus on outcome rewards, IPR uses the Monte Carlo (MC) method to estimate step-level rewards, allowing the agent to explore and generate new actions. These actions are then evaluated against the corresponding steps of the expert trajectory, identifying discrepancies and creating contrastive action pairs for training. The agent's performance is improved through a combination of outcome-level direct preference optimization (DPO), step-level DPO, and supervised fine-tuning (SFT) losses. Experiments on three complex agent tasks—WebShop, InterCodeSQL, and ALFWorld—show that IPR outperforms various strong baselines, demonstrating its effectiveness in enhancing action efficiency and generalization. The paper also discusses the limitations of the method and suggests future directions for improvement.
Reach us at info@study.space