GUICourse: From General Vision Language Model to Versatile GUI Agent

GUICourse: From General Vision Language Model to Versatile GUI Agent

17 Jun 2024 | Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guirong Chen, Yupeng Huo, Yuan Yao, Yankai Lin, Zhiyuan Liu, Maosong Sun
GUICourse is a comprehensive dataset suite designed to enhance the capabilities of Vision Language Models (VLMs) in GUI navigation tasks. The dataset includes three main components: GUIEnv, GUIAct, and GUIChat. GUIEnv focuses on improving OCR and grounding abilities through large-scale annotated screenshots and QA pairs. GUIAct provides navigation tasks in website and smartphone scenarios to enhance GUI knowledge. GUIChat includes conversational data to improve interaction skills of GUI agents. The datasets are used to train GUI agents based on various VLMs, such as Qwen-VL, Fuyu-8B, and MiniCPM-V. Experiments show that these agents perform better on common GUI tasks like Mind2Web and AITW compared to their baseline VLMs. Ablation studies reveal that GUIEnv data significantly improves OCR and grounding abilities, while high-resolution images and mixed GUIChat data enhance performance. The GUI agents demonstrate strong performance in multi-step tasks and can handle various GUI systems. The datasets are released for research and development, enabling the creation of versatile GUI agents that can assist humans in navigating digital tools effectively.GUICourse is a comprehensive dataset suite designed to enhance the capabilities of Vision Language Models (VLMs) in GUI navigation tasks. The dataset includes three main components: GUIEnv, GUIAct, and GUIChat. GUIEnv focuses on improving OCR and grounding abilities through large-scale annotated screenshots and QA pairs. GUIAct provides navigation tasks in website and smartphone scenarios to enhance GUI knowledge. GUIChat includes conversational data to improve interaction skills of GUI agents. The datasets are used to train GUI agents based on various VLMs, such as Qwen-VL, Fuyu-8B, and MiniCPM-V. Experiments show that these agents perform better on common GUI tasks like Mind2Web and AITW compared to their baseline VLMs. Ablation studies reveal that GUIEnv data significantly improves OCR and grounding abilities, while high-resolution images and mixed GUIChat data enhance performance. The GUI agents demonstrate strong performance in multi-step tasks and can handle various GUI systems. The datasets are released for research and development, enabling the creation of versatile GUI agents that can assist humans in navigating digital tools effectively.
Reach us at info@study.space
[slides] GUICourse%3A From General Vision Language Models to Versatile GUI Agents | StudySpace