SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

23 Feb 2024 | Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, Zhiyong Wu
SeeClick is a novel visual GUI agent that automates tasks on digital devices using only screenshots. Unlike existing GUI agents that rely on structured data (e.g., HTML), SeeClick uses visual grounding to locate screen elements based on instructions. This approach addresses the limitations of structured data, such as inaccessibility and inefficiency, by leveraging large vision-language models (LVLMs) with pre-training for GUI grounding. SeeClick is enhanced with a GUI grounding pre-training strategy and a method to automate the curation of grounding data. The first realistic GUI grounding benchmark, ScreenSpot, was created to evaluate SeeClick's performance across mobile, desktop, and web environments. ScreenSpot includes over 600 screenshots and 1200 instructions from various platforms. SeeClick demonstrates significant improvements over baselines in ScreenSpot and across three benchmark tasks. Comprehensive evaluations show that advancements in GUI grounding directly correlate with enhanced performance in downstream GUI agent tasks. SeeClick is adapted to mobile and web agent tasks, including MiniWob, AITW, and Mind2Web, achieving impressive results with minimal training data. SeeClick outperforms existing vision-based agents, highlighting the importance of GUI grounding in improving visual GUI agent performance. The paper also discusses limitations, ethical considerations, and related work in the field of visual GUI agents.SeeClick is a novel visual GUI agent that automates tasks on digital devices using only screenshots. Unlike existing GUI agents that rely on structured data (e.g., HTML), SeeClick uses visual grounding to locate screen elements based on instructions. This approach addresses the limitations of structured data, such as inaccessibility and inefficiency, by leveraging large vision-language models (LVLMs) with pre-training for GUI grounding. SeeClick is enhanced with a GUI grounding pre-training strategy and a method to automate the curation of grounding data. The first realistic GUI grounding benchmark, ScreenSpot, was created to evaluate SeeClick's performance across mobile, desktop, and web environments. ScreenSpot includes over 600 screenshots and 1200 instructions from various platforms. SeeClick demonstrates significant improvements over baselines in ScreenSpot and across three benchmark tasks. Comprehensive evaluations show that advancements in GUI grounding directly correlate with enhanced performance in downstream GUI agent tasks. SeeClick is adapted to mobile and web agent tasks, including MiniWob, AITW, and Mind2Web, achieving impressive results with minimal training data. SeeClick outperforms existing vision-based agents, highlighting the importance of GUI grounding in improving visual GUI agent performance. The paper also discusses limitations, ethical considerations, and related work in the field of visual GUI agents.
Reach us at info@study.space
[slides and audio] SeeClick%3A Harnessing GUI Grounding for Advanced Visual GUI Agents