GPT-4V(ision) is a Generalist Web Agent, if Grounded

GPT-4V(ision) is a Generalist Web Agent, if Grounded

2024-03-12 | Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, Yu Su
The paper explores the potential of large multimodal models (LMMs) like GPT-4V as generalist web agents, capable of following natural language instructions to complete tasks on any given website. The authors propose SEEACT, a web agent that leverages LMMs for integrated visual understanding and acting on the web. They evaluate SEEACT on the Mind2Web benchmark, both offline and online. Offline evaluation is conducted on cached websites, while online evaluation is performed on live websites using a custom tool. The results show that GPT-4V can successfully complete 51.1% of tasks on live websites if manually grounded into actions, outperforming text-only LMMs like GPT-4 and smaller models. However, grounding remains a significant challenge, with the best strategy still falling short of oracle grounding by 20-30%. The paper also highlights the importance of in-context learning for generalization to unseen websites and the discrepancy between online and offline evaluations, emphasizing the dynamic nature of web interactions.The paper explores the potential of large multimodal models (LMMs) like GPT-4V as generalist web agents, capable of following natural language instructions to complete tasks on any given website. The authors propose SEEACT, a web agent that leverages LMMs for integrated visual understanding and acting on the web. They evaluate SEEACT on the Mind2Web benchmark, both offline and online. Offline evaluation is conducted on cached websites, while online evaluation is performed on live websites using a custom tool. The results show that GPT-4V can successfully complete 51.1% of tasks on live websites if manually grounded into actions, outperforming text-only LMMs like GPT-4 and smaller models. However, grounding remains a significant challenge, with the best strategy still falling short of oracle grounding by 20-30%. The paper also highlights the importance of in-context learning for generalization to unseen websites and the discrepancy between online and offline evaluations, emphasizing the dynamic nature of web interactions.
Reach us at info@study.space
[slides and audio] GPT-4V(ision) is a Generalist Web Agent%2C if Grounded