The paper introduces GUICourse, a comprehensive suite of datasets designed to enhance the capabilities of Vision Language Models (VLMs) for training visual-based GUI agents. The datasets aim to improve VLMs' fundamental abilities, such as OCR and grounding, as well as their knowledge of GUI components and interactions. The datasets include:
1. **GUIEnv**: A large-scale dataset for improving VLMs' OCR and grounding capabilities, containing 10M pre-training data and 0.7M SFT data.
2. **GUIAct**: A multi-scenario dataset for enhancing VLMs' knowledge of GUI systems, including single-step and multi-step action instructions in website and smartphone scenarios.
3. **GUIChat**: A text-rich multi-modal dataset for improving the interaction skills of GUI agents, featuring 44k single-turn QA pairs and 6k multi-turn dialogues with text-rich images and bounding boxes.
The paper demonstrates that the proposed datasets significantly improve the performance of GUI agents on common GUI tasks compared to baseline VLMs. Experiments show that even a small-size GUI agent (3.1B parameters) can perform well on single-step and multi-step GUI tasks. Ablation studies further analyze the impact of different dataset components on the training process. The source code and datasets are available at https://github.com/yiye3/GUIcourse.The paper introduces GUICourse, a comprehensive suite of datasets designed to enhance the capabilities of Vision Language Models (VLMs) for training visual-based GUI agents. The datasets aim to improve VLMs' fundamental abilities, such as OCR and grounding, as well as their knowledge of GUI components and interactions. The datasets include:
1. **GUIEnv**: A large-scale dataset for improving VLMs' OCR and grounding capabilities, containing 10M pre-training data and 0.7M SFT data.
2. **GUIAct**: A multi-scenario dataset for enhancing VLMs' knowledge of GUI systems, including single-step and multi-step action instructions in website and smartphone scenarios.
3. **GUIChat**: A text-rich multi-modal dataset for improving the interaction skills of GUI agents, featuring 44k single-turn QA pairs and 6k multi-turn dialogues with text-rich images and bounding boxes.
The paper demonstrates that the proposed datasets significantly improve the performance of GUI agents on common GUI tasks compared to baseline VLMs. Experiments show that even a small-size GUI agent (3.1B parameters) can perform well on single-step and multi-step GUI tasks. Ablation studies further analyze the impact of different dataset components on the training process. The source code and datasets are available at https://github.com/yiye3/GUIcourse.