GUI Odyssey is a comprehensive dataset designed for training and evaluating cross-app GUI navigation agents. It consists of 7,735 episodes from six mobile devices, covering six types of cross-app tasks, 201 apps, and 1,400 app combinations. The dataset was collected through a rigorous process involving diverse cross-app navigation tasks, human demonstrations on an Android emulator, and quality checks. The dataset includes detailed metadata, such as screenshots, actions, and user instructions, and is annotated to ensure accuracy and completeness.
To evaluate cross-app navigation, the researchers developed OdysseyAgent, a multimodal agent based on the Qwen-VL model with a history resampling module. This module allows the model to efficiently attend to historical screenshot image tokens, improving the agent's ability to navigate across multiple apps. Extensive experiments show that OdysseyAgent outperforms existing models in both in-domain and out-of-domain settings. For instance, it achieves higher accuracy than fine-tuned Qwen-VL and zero-shot GPT-4V in both in-domain and out-of-domain tasks.
The dataset and code are available on GitHub at https://github.com/OpenGVLab/GUI-Odyssey. The study highlights the importance of cross-app navigation in real-world scenarios and the challenges in developing agents that can handle complex, multi-app tasks. The results demonstrate the effectiveness of the proposed approach and the potential of GUI Odyssey in advancing the field of autonomous GUI navigation.GUI Odyssey is a comprehensive dataset designed for training and evaluating cross-app GUI navigation agents. It consists of 7,735 episodes from six mobile devices, covering six types of cross-app tasks, 201 apps, and 1,400 app combinations. The dataset was collected through a rigorous process involving diverse cross-app navigation tasks, human demonstrations on an Android emulator, and quality checks. The dataset includes detailed metadata, such as screenshots, actions, and user instructions, and is annotated to ensure accuracy and completeness.
To evaluate cross-app navigation, the researchers developed OdysseyAgent, a multimodal agent based on the Qwen-VL model with a history resampling module. This module allows the model to efficiently attend to historical screenshot image tokens, improving the agent's ability to navigate across multiple apps. Extensive experiments show that OdysseyAgent outperforms existing models in both in-domain and out-of-domain settings. For instance, it achieves higher accuracy than fine-tuned Qwen-VL and zero-shot GPT-4V in both in-domain and out-of-domain tasks.
The dataset and code are available on GitHub at https://github.com/OpenGVLab/GUI-Odyssey. The study highlights the importance of cross-app navigation in real-world scenarios and the challenges in developing agents that can handle complex, multi-app tasks. The results demonstrate the effectiveness of the proposed approach and the potential of GUI Odyssey in advancing the field of autonomous GUI navigation.