Large Language Models Can Self-Improve At Web Agent Tasks

Large Language Models Can Self-Improve At Web Agent Tasks

30 May 2024 | Ajay Patel, Markus Hofmarcher, Claudiu Leoveanu-Condrei, Marius-Constantin Dinu, Chris Callison-Burch, Sepp Hochreiter
This paper explores the ability of large language models (LLMs) to self-improve their performance in complex, long-horizon web agent tasks using the WebArena benchmark. WebArena tasks require agents to navigate and perform actions on web pages to achieve specific objectives, such as adding a product to a wishlist or finding travel times between locations. The study investigates three synthetic training data mixtures: Mixture A (in-domain synthetic examples), Mixture B (in-domain and out-of-domain synthetic examples), and Mixture C (out-of-domain synthetic examples). The results show that fine-tuning on these mixtures improves the agent's performance, with Mixture B achieving a 31% improvement over the base model. The paper also introduces new evaluation metrics, including a capability score and an extension of the VERTEX score, to assess the agent's performance, robustness, and trajectory quality. The findings suggest that self-improvement techniques can enhance LLMs' capabilities in complex environments without relying on additional supervised training data.This paper explores the ability of large language models (LLMs) to self-improve their performance in complex, long-horizon web agent tasks using the WebArena benchmark. WebArena tasks require agents to navigate and perform actions on web pages to achieve specific objectives, such as adding a product to a wishlist or finding travel times between locations. The study investigates three synthetic training data mixtures: Mixture A (in-domain synthetic examples), Mixture B (in-domain and out-of-domain synthetic examples), and Mixture C (out-of-domain synthetic examples). The results show that fine-tuning on these mixtures improves the agent's performance, with Mixture B achieving a 31% improvement over the base model. The paper also introduces new evaluation metrics, including a capability score and an extension of the VERTEX score, to assess the agent's performance, robustness, and trajectory quality. The findings suggest that self-improvement techniques can enhance LLMs' capabilities in complex environments without relying on additional supervised training data.
Reach us at info@study.space
[slides and audio] Large Language Models Can Self-Improve At Web Agent Tasks