22 Jul 2024 | Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofr Press, Jonathan Berant
AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?
AssistantBench is a new benchmark consisting of 214 realistic tasks that can be automatically evaluated, covering various scenarios and domains. The benchmark highlights the limitations of current systems, including language models and retrieval-augmented language models, as no model reaches an accuracy of more than 25 points. While closed-book LMs perform well, they exhibit low precision due to hallucinations. State-of-the-art web agents score near zero. The benchmark introduces SEEPLANACT (SPA), a new web agent that significantly outperforms previous agents, and an ensemble of SPA and closed-book models achieves the best performance. Web navigation remains a major challenge.
The benchmark was created by first asking 18 participants to share recent information-seeking tasks that they could solve using the web but required a few minutes of browsing. This set was expanded by asking crowdworkers to use tasks from this seed set as templates for new tasks. Expert crowdworkers were also asked to share recent tasks requiring expertise in their field. The final set includes 214 tasks from 53 people, 35 of whom are domain experts, requiring browsing over 525 web pages from 258 different websites.
The benchmark evaluates the ability of web agents to browse the entire web and solve real-world tasks that are time-consuming for humans. Tasks are based on real information needs encountered by humans. To solve these tasks, an agent must autonomously browse the web to identify relevant web pages and interact with them to produce an output.
The benchmark includes tasks that require planning and reasoning, and the results show that current models struggle with these tasks. The best model, SPA, outperforms SEEACT by about 7 points, answering twice as many questions with higher precision. An ensemble of SPA and closed-book models achieves the best overall performance.
The benchmark highlights that web navigation remains a major challenge for current systems. Tasks provided by experts are most challenging, and errors during web navigation, such as choosing an incorrect trajectory or getting stuck in a loop, are frequent. Closed-book models often generate hallucinated facts, and retrieval-augmented models often fail to retrieve relevant information.
The benchmark also shows that proprietary chatbots like ChatGPT suffer from similar problems. The main contributions of the paper include the release of AssistantBench, the introduction of SPA, and the demonstration that AssistantBench is challenging for current systems. The benchmark provides a useful tool for evaluating future web agents.AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?
AssistantBench is a new benchmark consisting of 214 realistic tasks that can be automatically evaluated, covering various scenarios and domains. The benchmark highlights the limitations of current systems, including language models and retrieval-augmented language models, as no model reaches an accuracy of more than 25 points. While closed-book LMs perform well, they exhibit low precision due to hallucinations. State-of-the-art web agents score near zero. The benchmark introduces SEEPLANACT (SPA), a new web agent that significantly outperforms previous agents, and an ensemble of SPA and closed-book models achieves the best performance. Web navigation remains a major challenge.
The benchmark was created by first asking 18 participants to share recent information-seeking tasks that they could solve using the web but required a few minutes of browsing. This set was expanded by asking crowdworkers to use tasks from this seed set as templates for new tasks. Expert crowdworkers were also asked to share recent tasks requiring expertise in their field. The final set includes 214 tasks from 53 people, 35 of whom are domain experts, requiring browsing over 525 web pages from 258 different websites.
The benchmark evaluates the ability of web agents to browse the entire web and solve real-world tasks that are time-consuming for humans. Tasks are based on real information needs encountered by humans. To solve these tasks, an agent must autonomously browse the web to identify relevant web pages and interact with them to produce an output.
The benchmark includes tasks that require planning and reasoning, and the results show that current models struggle with these tasks. The best model, SPA, outperforms SEEACT by about 7 points, answering twice as many questions with higher precision. An ensemble of SPA and closed-book models achieves the best overall performance.
The benchmark highlights that web navigation remains a major challenge for current systems. Tasks provided by experts are most challenging, and errors during web navigation, such as choosing an incorrect trajectory or getting stuck in a loop, are frequent. Closed-book models often generate hallucinated facts, and retrieval-augmented models often fail to retrieve relevant information.
The benchmark also shows that proprietary chatbots like ChatGPT suffer from similar problems. The main contributions of the paper include the release of AssistantBench, the introduction of SPA, and the demonstration that AssistantBench is challenging for current systems. The benchmark provides a useful tool for evaluating future web agents.