Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents

Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents

10 Jul 2024 | Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian Li, Bill Yuchen Lin
This paper introduces Exploration-based Trajectory Optimization (ETO), a novel learning method for Large Language Models (LLMs) that enhances the performance of open LLM agents by leveraging exploration failures. Unlike previous methods that rely solely on successful expert trajectories, ETO allows agents to learn from their exploration failures, leading to improved performance through an iterative optimization framework. The method involves two phases: exploration and training. During exploration, the agent interacts with the environment to complete tasks and collects failure trajectories. These failure trajectories are then paired with expert trajectories to create contrastive trajectory pairs. In the training phase, the agent uses these pairs to update its policy using contrastive learning methods like DPO. This iterative process of exploration and training enables continuous improvement in the agent's performance. Experiments on three complex tasks—WebShop for web navigation, ScienceWorld for simulated science experiments, and ALFWorld for embodied household tasks—demonstrate that ETO consistently outperforms baseline methods. The method shows significant improvements in task-solving efficiency and generalization capabilities, even in scenarios where expert trajectories are not available. ETO achieves a 22% improvement over SFT on the challenging out-of-distribution test set in ScienceWorld, highlighting its strong generalization ability. The method also demonstrates enhanced task-solving efficiency, achieving higher rewards with fewer action steps. In extreme scenarios where expert trajectories are not available, ETO still performs well in a self-play mode, further emphasizing its potential. The paper also explores the impact of iteration numbers on ETO performance and the effectiveness of different contrastive data construction strategies. Results show that ETO performs well in the first few iterations but may decline after more iterations. The method also compares different strategies for constructing contrastive trajectory pairs, finding that trajectory-wise contrastive yields the best performance. Additionally, the paper evaluates the performance of ETO in scenarios without expert trajectories, showing that it can still achieve significant improvements through self-play and exploration. The paper concludes that ETO is a promising method for enhancing the capabilities of LLM agents by learning from exploration failures. The method's iterative exploration-training framework allows agents to continuously improve their performance, making it effective for a wide range of tasks. The results demonstrate that ETO outperforms existing methods in terms of performance, efficiency, and generalization capabilities, making it a valuable approach for developing powerful LLM agents.This paper introduces Exploration-based Trajectory Optimization (ETO), a novel learning method for Large Language Models (LLMs) that enhances the performance of open LLM agents by leveraging exploration failures. Unlike previous methods that rely solely on successful expert trajectories, ETO allows agents to learn from their exploration failures, leading to improved performance through an iterative optimization framework. The method involves two phases: exploration and training. During exploration, the agent interacts with the environment to complete tasks and collects failure trajectories. These failure trajectories are then paired with expert trajectories to create contrastive trajectory pairs. In the training phase, the agent uses these pairs to update its policy using contrastive learning methods like DPO. This iterative process of exploration and training enables continuous improvement in the agent's performance. Experiments on three complex tasks—WebShop for web navigation, ScienceWorld for simulated science experiments, and ALFWorld for embodied household tasks—demonstrate that ETO consistently outperforms baseline methods. The method shows significant improvements in task-solving efficiency and generalization capabilities, even in scenarios where expert trajectories are not available. ETO achieves a 22% improvement over SFT on the challenging out-of-distribution test set in ScienceWorld, highlighting its strong generalization ability. The method also demonstrates enhanced task-solving efficiency, achieving higher rewards with fewer action steps. In extreme scenarios where expert trajectories are not available, ETO still performs well in a self-play mode, further emphasizing its potential. The paper also explores the impact of iteration numbers on ETO performance and the effectiveness of different contrastive data construction strategies. Results show that ETO performs well in the first few iterations but may decline after more iterations. The method also compares different strategies for constructing contrastive trajectory pairs, finding that trajectory-wise contrastive yields the best performance. Additionally, the paper evaluates the performance of ETO in scenarios without expert trajectories, showing that it can still achieve significant improvements through self-play and exploration. The paper concludes that ETO is a promising method for enhancing the capabilities of LLM agents by learning from exploration failures. The method's iterative exploration-training framework allows agents to continuously improve their performance, making it effective for a wide range of tasks. The results demonstrate that ETO outperforms existing methods in terms of performance, efficiency, and generalization capabilities, making it a valuable approach for developing powerful LLM agents.
Reach us at info@study.space
[slides and audio] Trial and Error%3A Exploration-Based Trajectory Optimization for LLM Agents