Understanding Tree Search for Language Model Agents

This paper introduces an inference-time search algorithm for language model (LM) agents to improve their performance on realistic web tasks. The proposed method enables LM agents to perform exploration and multi-step planning in interactive web environments through a best-first tree search approach. This approach is complementary to existing state-of-the-art agents and is the first tree search algorithm for LM agents that shows effectiveness on realistic web tasks. On the VisualWebArena benchmark, applying the search algorithm on top of a GPT-4o agent yields a 39.7% relative increase in success rate compared to the same baseline without search, setting a state-of-the-art success rate of 26.4%. On WebArena, search also yields a 28.0% relative improvement over a baseline agent, setting a competitive success rate of 19.2%. The experiments highlight the effectiveness of search for web agents, and performance scales with increased test-time compute. The code and models are publicly released at jykoh.com/search-agents. The paper discusses the challenges of using LM agents for web tasks, including their inability to leverage test-time computation for exploration and multi-step planning. The proposed search algorithm is grounded within the actual environment space and is guided with environmental feedback. It allows agents to explore a much larger number of potentially promising trajectories at test time, reducing uncertainty through explicit exploration and multi-step planning. The value function is computed by marginalizing over reasoning chains of a multimodal LM conditioned on the agent's observations, producing fine-grained scores to effectively guide search. The paper also discusses the background of realistic simulated web environments, language-guided autonomous agents, and search and planning algorithms. The method is described in detail, including the agent backbone, value function, and search algorithm. The experiments show that the search procedure is complementary with existing LM agents and enables these models to perform better on harder and longer horizon tasks. The results demonstrate that search significantly improves the success rates of LM agents on realistic web tasks, with the best performance achieved on the VisualWebArena benchmark. The paper also discusses the limitations of the approach, including the computational cost of search and the difficulty of backtracking in real-world environments. The authors conclude that inference-time search is a key component for building capable agents that can plan, reason, and act autonomously to perform computer tasks.This paper introduces an inference-time search algorithm for language model (LM) agents to improve their performance on realistic web tasks. The proposed method enables LM agents to perform exploration and multi-step planning in interactive web environments through a best-first tree search approach. This approach is complementary to existing state-of-the-art agents and is the first tree search algorithm for LM agents that shows effectiveness on realistic web tasks. On the VisualWebArena benchmark, applying the search algorithm on top of a GPT-4o agent yields a 39.7% relative increase in success rate compared to the same baseline without search, setting a state-of-the-art success rate of 26.4%. On WebArena, search also yields a 28.0% relative improvement over a baseline agent, setting a competitive success rate of 19.2%. The experiments highlight the effectiveness of search for web agents, and performance scales with increased test-time compute. The code and models are publicly released at jykoh.com/search-agents. The paper discusses the challenges of using LM agents for web tasks, including their inability to leverage test-time computation for exploration and multi-step planning. The proposed search algorithm is grounded within the actual environment space and is guided with environmental feedback. It allows agents to explore a much larger number of potentially promising trajectories at test time, reducing uncertainty through explicit exploration and multi-step planning. The value function is computed by marginalizing over reasoning chains of a multimodal LM conditioned on the agent's observations, producing fine-grained scores to effectively guide search. The paper also discusses the background of realistic simulated web environments, language-guided autonomous agents, and search and planning algorithms. The method is described in detail, including the agent backbone, value function, and search algorithm. The experiments show that the search procedure is complementary with existing LM agents and enables these models to perform better on harder and longer horizon tasks. The results demonstrate that search significantly improves the success rates of LM agents on realistic web tasks, with the best performance achieved on the VisualWebArena benchmark. The paper also discusses the limitations of the approach, including the computational cost of search and the difficulty of backtracking in real-world environments. The authors conclude that inference-time search is a key component for building capable agents that can plan, reason, and act autonomously to perform computer tasks.

TREE SEARCH FOR LANGUAGE MODEL AGENTS

1 Jul 2024 | Jing Yu Koh, Stephen McAleer, Daniel Fried, Ruslan Salakhutdinov