The paper introduces *SeeClick*, a novel visual GUI agent designed to automate complex tasks on digital devices using only screenshots as input. The primary challenge in developing such agents is GUI grounding, the ability to accurately locate screen elements based on instructions. To address this, *SeeClick* is enhanced with GUI grounding pre-training and a method to automate the curation of GUI grounding data. The authors also create *ScreenSpot*, a realistic GUI grounding benchmark that includes over 600 screenshots and 1200 instructions from various GUI platforms. Evaluations on *ScreenSpot* and three widely used benchmarks show that *SeeClick* outperforms existing models, demonstrating the effectiveness of GUI grounding pre-training. The paper concludes by discussing the limitations and ethical considerations of GUI agents, emphasizing the importance of privacy, safety, and bias mitigation.The paper introduces *SeeClick*, a novel visual GUI agent designed to automate complex tasks on digital devices using only screenshots as input. The primary challenge in developing such agents is GUI grounding, the ability to accurately locate screen elements based on instructions. To address this, *SeeClick* is enhanced with GUI grounding pre-training and a method to automate the curation of GUI grounding data. The authors also create *ScreenSpot*, a realistic GUI grounding benchmark that includes over 600 screenshots and 1200 instructions from various GUI platforms. Evaluations on *ScreenSpot* and three widely used benchmarks show that *SeeClick* outperforms existing models, demonstrating the effectiveness of GUI grounding pre-training. The paper concludes by discussing the limitations and ethical considerations of GUI agents, emphasizing the importance of privacy, safety, and bias mitigation.