22 Jul 2024 | Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, Jonathan Berant
The paper introduces ASSISTANTBENCH, a new benchmark designed to evaluate the ability of web agents to perform realistic and time-consuming tasks on the web. The benchmark consists of 214 diverse tasks that cover various scenarios and domains, requiring agents to browse the web, interact with multiple websites, and synthesize information to produce answers. The authors also propose SEEPLANACT (SPA), an advanced web agent equipped with planning and memory components, which significantly outperforms existing agents. The evaluation reveals that current models, including closed-book and retrieval-augmented language models, struggle with ASSISTANTBENCH tasks, with no model achieving an accuracy of more than 25 points. SPA, on the other hand, achieves a score of 25 points, outperforming other agents by about 7 points. The paper also analyzes the failures of current systems, highlighting that web navigation remains a major challenge, with errors often occurring due to incorrect trajectories or getting stuck in loops. The authors conclude by discussing the limitations of the benchmark and the potential for future research to improve web agents' capabilities.The paper introduces ASSISTANTBENCH, a new benchmark designed to evaluate the ability of web agents to perform realistic and time-consuming tasks on the web. The benchmark consists of 214 diverse tasks that cover various scenarios and domains, requiring agents to browse the web, interact with multiple websites, and synthesize information to produce answers. The authors also propose SEEPLANACT (SPA), an advanced web agent equipped with planning and memory components, which significantly outperforms existing agents. The evaluation reveals that current models, including closed-book and retrieval-augmented language models, struggle with ASSISTANTBENCH tasks, with no model achieving an accuracy of more than 25 points. SPA, on the other hand, achieves a score of 25 points, outperforming other agents by about 7 points. The paper also analyzes the failures of current systems, highlighting that web navigation remains a major challenge, with errors often occurring due to incorrect trajectories or getting stuck in loops. The authors conclude by discussing the limitations of the benchmark and the potential for future research to improve web agents' capabilities.