Large Language Models Can Self-Improve At Web Agent Tasks

Large Language Models Can Self-Improve At Web Agent Tasks

30 May 2024 | Ajay Patel, Markus Hofmarcher, Claudiu Leoveanu-Condrei, Marius-Constantin Dinu, Chris Callison-Burch, Sepp Hochreiter
Large language models (LLMs) can self-improve their performance in complex, long-horizon web agent tasks. This study explores the extent to which LLMs can self-improve using synthetic data generated by the model itself. The research uses the WebArena benchmark, where an agent must autonomously navigate and perform actions on web pages to achieve a specified objective. The study generates three synthetic training data mixtures: in-domain, in-domain and out-of-domain, and out-of-domain. Fine-tuning on these mixtures leads to a 31% improvement in task completion rate over the base model on the WebArena benchmark. The study also introduces novel evaluation metrics to assess the performance, robustness, capabilities, and quality of trajectories of the fine-tuned agent models. These metrics provide a more nuanced assessment of improvements and degradations than aggregate-level benchmark scores. The results show that self-improvement techniques can enhance the performance of LLMs in complex, multi-step tasks. The study also finds that self-improved agents can acquire new capabilities, although they may also lose some. The quality of generated trajectories is not significantly degraded when fine-tuning on Mixtures A and B, but degrades when fine-tuning on Mixture C. The study concludes that self-improvement techniques can boost the performance of LLMs in complex, multi-step agent environments without relying on supervised training data. The proposed self-improvement procedures are a promising step towards improving the performance of LLMs in such environments. The study also highlights the limitations of the current methods, including the potential for reinforcing incorrect actions and biases, and the need for further research to improve the reliability of the evaluation metrics. The study also discusses the broader impacts of the research, including the potential for future work on larger and more diverse benchmarks.Large language models (LLMs) can self-improve their performance in complex, long-horizon web agent tasks. This study explores the extent to which LLMs can self-improve using synthetic data generated by the model itself. The research uses the WebArena benchmark, where an agent must autonomously navigate and perform actions on web pages to achieve a specified objective. The study generates three synthetic training data mixtures: in-domain, in-domain and out-of-domain, and out-of-domain. Fine-tuning on these mixtures leads to a 31% improvement in task completion rate over the base model on the WebArena benchmark. The study also introduces novel evaluation metrics to assess the performance, robustness, capabilities, and quality of trajectories of the fine-tuned agent models. These metrics provide a more nuanced assessment of improvements and degradations than aggregate-level benchmark scores. The results show that self-improvement techniques can enhance the performance of LLMs in complex, multi-step tasks. The study also finds that self-improved agents can acquire new capabilities, although they may also lose some. The quality of generated trajectories is not significantly degraded when fine-tuning on Mixtures A and B, but degrades when fine-tuning on Mixture C. The study concludes that self-improvement techniques can boost the performance of LLMs in complex, multi-step agent environments without relying on supervised training data. The proposed self-improvement procedures are a promising step towards improving the performance of LLMs in such environments. The study also highlights the limitations of the current methods, including the potential for reinforcing incorrect actions and biases, and the need for further research to improve the reliability of the evaluation metrics. The study also discusses the broader impacts of the research, including the potential for future work on larger and more diverse benchmarks.
Reach us at info@study.space
Understanding Large Language Models Can Self-Improve At Web Agent Tasks