WebCanvas: Benchmarking Web Agents in Online Environments

WebCanvas: Benchmarking Web Agents in Online Environments

16 Jul 2024 | Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, Zhengyang Wu
WebCanvas is an innovative online evaluation framework for web agents that addresses the dynamic nature of web interactions. It introduces three main components: (1) a novel evaluation metric that captures critical intermediate actions or states while disregarding noise; (2) a benchmark dataset called Mind2Web-Live, containing 542 tasks with 2439 intermediate evaluation states; and (3) lightweight and generalizable annotation tools and testing pipelines for maintaining high-quality, up-to-date data. WebCanvas also provides an open-source agent framework with extensible modules for reasoning, enabling the community to conduct online inference and evaluations. The best-performing agent achieved a task success rate of 23.1% and a task completion rate of 48.8% on the Mind2Web-Live test set. The framework accounts for the non-uniqueness of paths in online web interactions, with "Trophies" representing step scores earned upon reaching key nodes. The framework also includes a cost-effective maintenance strategy to sustain evaluation validity, with scheduled monitoring and automated alerts for data corrections. The Mind2Web-Live dataset was constructed by sampling tasks from the Mind2Web dataset and re-annotating them in a real-world online environment. The dataset includes 542 tasks, 2439 key nodes, and 4550 detailed annotation steps. The framework's evaluation metrics include step score and task score, with step score evaluating agent performance with regard to each key node and task score assessing task completeness and execution efficiency. The framework also includes a reward module with human-labeled reward, which enhances agent performance in online web environments. The study found that GPT-4 outperformed other models in both effectiveness and efficiency in web agent tasks in live environments, while models trained on static datasets did not generalize well to online environments. The study also analyzed factors influencing agent performance, including task complexity, website dynamics, and experimental setup. The results showed that increased task complexity directly correlates with diminished agent performance, and that agents handling entertainment-related tasks performed better than those handling shopping or travel tasks. The study also found that the integration of a reward module with human-labeled reward improved agent performance in online web environments. The framework provides a comprehensive evaluation of web agents in real-world environments through key nodes and corresponding evaluation functions. The study encourages further research on online datasets, web agents, and evaluation functions to advance the field of autonomous intelligence.WebCanvas is an innovative online evaluation framework for web agents that addresses the dynamic nature of web interactions. It introduces three main components: (1) a novel evaluation metric that captures critical intermediate actions or states while disregarding noise; (2) a benchmark dataset called Mind2Web-Live, containing 542 tasks with 2439 intermediate evaluation states; and (3) lightweight and generalizable annotation tools and testing pipelines for maintaining high-quality, up-to-date data. WebCanvas also provides an open-source agent framework with extensible modules for reasoning, enabling the community to conduct online inference and evaluations. The best-performing agent achieved a task success rate of 23.1% and a task completion rate of 48.8% on the Mind2Web-Live test set. The framework accounts for the non-uniqueness of paths in online web interactions, with "Trophies" representing step scores earned upon reaching key nodes. The framework also includes a cost-effective maintenance strategy to sustain evaluation validity, with scheduled monitoring and automated alerts for data corrections. The Mind2Web-Live dataset was constructed by sampling tasks from the Mind2Web dataset and re-annotating them in a real-world online environment. The dataset includes 542 tasks, 2439 key nodes, and 4550 detailed annotation steps. The framework's evaluation metrics include step score and task score, with step score evaluating agent performance with regard to each key node and task score assessing task completeness and execution efficiency. The framework also includes a reward module with human-labeled reward, which enhances agent performance in online web environments. The study found that GPT-4 outperformed other models in both effectiveness and efficiency in web agent tasks in live environments, while models trained on static datasets did not generalize well to online environments. The study also analyzed factors influencing agent performance, including task complexity, website dynamics, and experimental setup. The results showed that increased task complexity directly correlates with diminished agent performance, and that agents handling entertainment-related tasks performed better than those handling shopping or travel tasks. The study also found that the integration of a reward module with human-labeled reward improved agent performance in online web environments. The framework provides a comprehensive evaluation of web agents in real-world environments through key nodes and corresponding evaluation functions. The study encourages further research on online datasets, web agents, and evaluation functions to advance the field of autonomous intelligence.
Reach us at info@study.space