Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents

Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents

10 Jul 2024 | Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian Li, Bill Yuchen Lin
This paper introduces a novel exploration-based trajectory optimization (ETO) method for enhancing the performance of large language model (LLM) agents. Unlike previous methods that rely solely on successful expert trajectories, ETO allows agents to learn from both successful and failed exploration attempts. The method involves an iterative exploration-training loop where the agent first interacts with the environment to gather failure trajectories, which are then used to create contrastive trajectory pairs. During the training phase, the agent updates its policy using contrastive learning methods like DPO (Direct Preference Optimization). This approach is evaluated on three complex tasks: WebShop, ScienceWorld, and ALFWorld, demonstrating significant improvements over baseline methods, including behavioral cloning (SFT) and other robust baselines. The experiments show that ETO consistently outperforms SFT by a large margin, achieving up to 22% improvement on the out-of-distribution test set in ScienceWorld. Additionally, ETO exhibits strong generalization capabilities and task-solving efficiency, even in scenarios without expert trajectories. The paper also discusses the limitations of the method and suggests future directions for improving its performance and applicability.This paper introduces a novel exploration-based trajectory optimization (ETO) method for enhancing the performance of large language model (LLM) agents. Unlike previous methods that rely solely on successful expert trajectories, ETO allows agents to learn from both successful and failed exploration attempts. The method involves an iterative exploration-training loop where the agent first interacts with the environment to gather failure trajectories, which are then used to create contrastive trajectory pairs. During the training phase, the agent updates its policy using contrastive learning methods like DPO (Direct Preference Optimization). This approach is evaluated on three complex tasks: WebShop, ScienceWorld, and ALFWorld, demonstrating significant improvements over baseline methods, including behavioral cloning (SFT) and other robust baselines. The experiments show that ETO consistently outperforms SFT by a large margin, achieving up to 22% improvement on the out-of-distribution test set in ScienceWorld. Additionally, ETO exhibits strong generalization capabilities and task-solving efficiency, even in scenarios without expert trajectories. The paper also discusses the limitations of the method and suggests future directions for improving its performance and applicability.
Reach us at info@study.space
Understanding Trial and Error%3A Exploration-Based Trajectory Optimization for LLM Agents