WebCanvas: Benchmarking Web Agents in Online Environments

WebCanvas: Benchmarking Web Agents in Online Environments

16 Jul 2024 | Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, Zhengyang Wu
WebCanvas is an innovative online evaluation framework designed to address the dynamic nature of web interactions, aiming to bridge the gap between static benchmarks and real-world web environments. The framework consists of three main components: a novel evaluation metric that captures critical intermediate actions or states, a benchmark dataset called Mind2Web-Live, and lightweight annotation tools and testing pipelines. WebCanvas introduces the concept of "key nodes" to facilitate detailed and continuous analysis of agent behaviors. The Mind2Web-Live dataset contains 542 tasks with 2439 intermediate evaluation states, and the framework employs efficient data maintenance strategies to ensure the validity of evaluations. The best-performing agent achieved a task success rate of 23.1% and a task completion rate of 48.8% on the Mind2Web-Live test set. The study also analyzes performance discrepancies across various websites, domains, and experimental environments, highlighting the importance of consistent setup for reliable results. Additionally, the framework includes a universal agent framework with extensible modules for reasoning, providing a foundation for further research and development in web agent evaluation.WebCanvas is an innovative online evaluation framework designed to address the dynamic nature of web interactions, aiming to bridge the gap between static benchmarks and real-world web environments. The framework consists of three main components: a novel evaluation metric that captures critical intermediate actions or states, a benchmark dataset called Mind2Web-Live, and lightweight annotation tools and testing pipelines. WebCanvas introduces the concept of "key nodes" to facilitate detailed and continuous analysis of agent behaviors. The Mind2Web-Live dataset contains 542 tasks with 2439 intermediate evaluation states, and the framework employs efficient data maintenance strategies to ensure the validity of evaluations. The best-performing agent achieved a task success rate of 23.1% and a task completion rate of 48.8% on the Mind2Web-Live test set. The study also analyzes performance discrepancies across various websites, domains, and experimental environments, highlighting the importance of consistent setup for reliable results. Additionally, the framework includes a universal agent framework with extensible modules for reasoning, providing a foundation for further research and development in web agent evaluation.
Reach us at info@study.space