WebCanvas is an innovative online evaluation framework designed to address the dynamic nature of web interactions, aiming to bridge the gap between static benchmarks and real-world web environments. The framework consists of three main components: a novel evaluation metric that captures critical intermediate actions or states, a benchmark dataset called Mind2Web-Live, and lightweight annotation tools and testing pipelines. WebCanvas introduces the concept of "key nodes" to facilitate detailed and continuous analysis of agent behaviors. The Mind2Web-Live dataset contains 542 tasks with 2439 intermediate evaluation states, and the framework employs efficient data maintenance strategies to ensure the validity of evaluations. The best-performing agent achieved a task success rate of 23.1% and a task completion rate of 48.8% on the Mind2Web-Live test set. The study also analyzes performance discrepancies across various websites, domains, and experimental environments, highlighting the importance of consistent setup for reliable results. Additionally, the framework includes a universal agent framework with extensible modules for reasoning, providing a foundation for further research and development in web agent evaluation.WebCanvas is an innovative online evaluation framework designed to address the dynamic nature of web interactions, aiming to bridge the gap between static benchmarks and real-world web environments. The framework consists of three main components: a novel evaluation metric that captures critical intermediate actions or states, a benchmark dataset called Mind2Web-Live, and lightweight annotation tools and testing pipelines. WebCanvas introduces the concept of "key nodes" to facilitate detailed and continuous analysis of agent behaviors. The Mind2Web-Live dataset contains 542 tasks with 2439 intermediate evaluation states, and the framework employs efficient data maintenance strategies to ensure the validity of evaluations. The best-performing agent achieved a task success rate of 23.1% and a task completion rate of 48.8% on the Mind2Web-Live test set. The study also analyzes performance discrepancies across various websites, domains, and experimental environments, highlighting the importance of consistent setup for reliable results. Additionally, the framework includes a universal agent framework with extensible modules for reasoning, providing a foundation for further research and development in web agent evaluation.