Understanding GUICourse%3A From General Vision Language Models to Versatile GUI Agents

The paper introduces GUICourse, a comprehensive suite of datasets designed to enhance the capabilities of Vision Language Models (VLMs) for training visual-based GUI agents. The datasets aim to improve VLMs' fundamental abilities, such as OCR and grounding, as well as their knowledge of GUI components and interactions. The datasets include: 1. **GUIEnv**: A large-scale dataset for improving VLMs' OCR and grounding capabilities, containing 10M pre-training data and 0.7M SFT data. 2. **GUIAct**: A multi-scenario dataset for enhancing VLMs' knowledge of GUI systems, including single-step and multi-step action instructions in website and smartphone scenarios. 3. **GUIChat**: A text-rich multi-modal dataset for improving the interaction skills of GUI agents, featuring 44k single-turn QA pairs and 6k multi-turn dialogues with text-rich images and bounding boxes. The paper demonstrates that the proposed datasets significantly improve the performance of GUI agents on common GUI tasks compared to baseline VLMs. Experiments show that even a small-size GUI agent (3.1B parameters) can perform well on single-step and multi-step GUI tasks. Ablation studies further analyze the impact of different dataset components on the training process. The source code and datasets are available at https://github.com/yiye3/GUIcourse.The paper introduces GUICourse, a comprehensive suite of datasets designed to enhance the capabilities of Vision Language Models (VLMs) for training visual-based GUI agents. The datasets aim to improve VLMs' fundamental abilities, such as OCR and grounding, as well as their knowledge of GUI components and interactions. The datasets include: 1. **GUIEnv**: A large-scale dataset for improving VLMs' OCR and grounding capabilities, containing 10M pre-training data and 0.7M SFT data. 2. **GUIAct**: A multi-scenario dataset for enhancing VLMs' knowledge of GUI systems, including single-step and multi-step action instructions in website and smartphone scenarios. 3. **GUIChat**: A text-rich multi-modal dataset for improving the interaction skills of GUI agents, featuring 44k single-turn QA pairs and 6k multi-turn dialogues with text-rich images and bounding boxes. The paper demonstrates that the proposed datasets significantly improve the performance of GUI agents on common GUI tasks compared to baseline VLMs. Experiments show that even a small-size GUI agent (3.1B parameters) can perform well on single-step and multi-step GUI tasks. Ablation studies further analyze the impact of different dataset components on the training process. The source code and datasets are available at https://github.com/yiye3/GUIcourse.

GUICourse: From General Vision Language Model to Versatile GUI Agent

17 Jun 2024 | Wentong Chen1, Junbo Cui2, Jinyi Hu2*, Yujia Qin2, Junjie Fang3 Yue Zhao4, Chongyi Wang5, Jun Liu6, Guirong Chen7, Yupeng Huo7, Yuan Yao21, Yankai Lin1†, Zhiyuan Liu5, Maosong Sun2

GUICourse: From General Vision Language Model to Versatile GUI Agent

17 Jun 2024 | Wentong Chen1*, Junbo Cui2*, Jinyi Hu2*, Yujia Qin2, Junjie Fang3 Yue Zhao4, Chongyi Wang5, Jun Liu6, Guirong Chen7, Yupeng Huo7, Yuan Yao21, Yankai Lin1†, Zhiyuan Liu5, Maosong Sun2

17 Jun 2024 | Wentong Chen1, Junbo Cui2, Jinyi Hu2*, Yujia Qin2, Junjie Fang3 Yue Zhao4, Chongyi Wang5, Jun Liu6, Guirong Chen7, Yupeng Huo7, Yuan Yao21, Yankai Lin1†, Zhiyuan Liu5, Maosong Sun2