GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents

GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents

16 Jun 2024 | Dongping Chen1*†, Yue Huang2*, Siyuan Wu1*, Jingyu Tang1*, Liuyi Chen1, Yilin Bai1, Zhigang He1, Chenlong Wang1, Huichi Zhou1, Yiqiang Li1, Tianshuo Zhou1, Yue Yu1, Chujie Gao1, Qihui Zhang1, Yi Gui1, Zhen Li1, Yao Wan1†, Pan Zhou1, Jianfeng Gao3, Lichao Sun4
GUI-WORLD is a new dataset designed to evaluate and enhance the capabilities of Multimodal Large Language Models (MLLMs) in understanding Graphical User Interface (GUI) content, particularly in dynamic and sequential tasks. The dataset includes over 12,000 GUI videos, covering six scenarios such as desktop and mobile applications, multi-window interactions, and extended reality (XR) environments. It features meticulously crafted Human-MLLM annotations, including detailed captions, keyframes, and diverse types of questions and answers. The dataset is used to assess the performance of current state-of-the-art MLLMs, including ImageLLMs and VideoLLMs, in understanding various GUI content, especially dynamic and sequential content. The findings reveal that ImageLLMs struggle with dynamic GUI content without manually annotated keyframes or operation history, while VideoLLMs fall short in all GUI-oriented tasks due to the sparse GUI video dataset. Based on GUI-WORLD, the paper introduces a new model, GUI-Vid, a GUI-oriented VideoLLM with enhanced capabilities to handle various and complex GUI tasks. However, due to the limitations in the performance of base LLMs, using VideoLLMs as GUI agents remains a significant challenge. The paper also discusses various factors critical to GUI understanding, including the integration of textual information, the number of keyframes, and image resolutions. Overall, the key contributions of this paper are a new dataset, a novel model, and comprehensive experiments and valuable insights. The dataset and code are publicly available at the project homepage.GUI-WORLD is a new dataset designed to evaluate and enhance the capabilities of Multimodal Large Language Models (MLLMs) in understanding Graphical User Interface (GUI) content, particularly in dynamic and sequential tasks. The dataset includes over 12,000 GUI videos, covering six scenarios such as desktop and mobile applications, multi-window interactions, and extended reality (XR) environments. It features meticulously crafted Human-MLLM annotations, including detailed captions, keyframes, and diverse types of questions and answers. The dataset is used to assess the performance of current state-of-the-art MLLMs, including ImageLLMs and VideoLLMs, in understanding various GUI content, especially dynamic and sequential content. The findings reveal that ImageLLMs struggle with dynamic GUI content without manually annotated keyframes or operation history, while VideoLLMs fall short in all GUI-oriented tasks due to the sparse GUI video dataset. Based on GUI-WORLD, the paper introduces a new model, GUI-Vid, a GUI-oriented VideoLLM with enhanced capabilities to handle various and complex GUI tasks. However, due to the limitations in the performance of base LLMs, using VideoLLMs as GUI agents remains a significant challenge. The paper also discusses various factors critical to GUI understanding, including the integration of textual information, the number of keyframes, and image resolutions. Overall, the key contributions of this paper are a new dataset, a novel model, and comprehensive experiments and valuable insights. The dataset and code are publicly available at the project homepage.
Reach us at info@study.space