2024 | Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, Yu Su
This paper introduces SEEACT, a generalist web agent that leverages large multimodal models (LMMs) like GPT-4V to perform tasks on websites. The agent uses LMMs for integrated visual understanding and web interaction. The study evaluates SEEACT on the MIND2WEB benchmark, which includes over 2,000 complex web tasks. The agent successfully completes 51.1% of tasks on live websites when grounded, outperforming text-only LLMs and smaller models. However, grounding remains a major challenge, with a 20-30% gap compared to oracle grounding. The best grounding strategy combines HTML structure and visuals, but further improvements are needed. The study also highlights the importance of online evaluation for assessing web agents, as it better reflects real-world performance. SEEACT demonstrates the potential of LMMs for generalist web agents, but challenges in visual grounding and error correction remain. The results show that LMMs like GPT-4V can perform complex tasks on websites, but their effectiveness depends on accurate grounding strategies. The study also emphasizes the need for robust grounding methods to improve the performance of web agents. Overall, the research highlights the potential of LMMs for generalist web agents, but further work is needed to address the challenges in visual grounding and error correction.This paper introduces SEEACT, a generalist web agent that leverages large multimodal models (LMMs) like GPT-4V to perform tasks on websites. The agent uses LMMs for integrated visual understanding and web interaction. The study evaluates SEEACT on the MIND2WEB benchmark, which includes over 2,000 complex web tasks. The agent successfully completes 51.1% of tasks on live websites when grounded, outperforming text-only LLMs and smaller models. However, grounding remains a major challenge, with a 20-30% gap compared to oracle grounding. The best grounding strategy combines HTML structure and visuals, but further improvements are needed. The study also highlights the importance of online evaluation for assessing web agents, as it better reflects real-world performance. SEEACT demonstrates the potential of LMMs for generalist web agents, but challenges in visual grounding and error correction remain. The results show that LMMs like GPT-4V can perform complex tasks on websites, but their effectiveness depends on accurate grounding strategies. The study also emphasizes the need for robust grounding methods to improve the performance of web agents. Overall, the research highlights the potential of LMMs for generalist web agents, but further work is needed to address the challenges in visual grounding and error correction.