GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents

GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents

16 Jun 2024 | Dongping Chen, Yue Huang, Siyuan Wu, Jingyu Tang, Liuyi Chen, Yilin Bai, Zhigang He, Chenlong Wang, Huichi Zhou, Yiqiang Li, Tianshuo Zhou, Yue Yu, Chujie Gao, Qihui Zhang, Yi Gui, Zhen Li, Yao Wan, Pan Zhou, Jianfeng Gao, Lichao Sun
GUI-WORLD is a new dataset designed to evaluate and enhance the capabilities of Multimodal Large Language Models (MLLMs) in understanding Graphical User Interface (GUI) content. The dataset includes over 12,000 GUI videos, covering six GUI scenarios and eight types of GUI-oriented questions in three formats. It features meticulously crafted Human-MLLM annotations, providing a comprehensive set of questions and instructions for evaluating MLLMs in GUI understanding tasks. The dataset includes detailed captions, human-annotated keyframes, and diverse types of QA, aiming to benchmark and improve the general GUI-oriented capabilities of MLLMs. The paper evaluates the performance of current state-of-the-art MLLMs, including ImageLLMs and VideoLLMs, in understanding various types of GUI content, especially dynamic and sequential content. The findings reveal that ImageLLMs struggle with dynamic GUI content without manually annotated keyframes or operation history. On the other hand, VideoLLMs fall short in all GUI-oriented tasks given the sparse GUI video dataset. Based on GUI-WORLD, the paper takes the initial step of leveraging a fine-tuned VideoLLM as a GUI agent, demonstrating an improved understanding of various GUI tasks. However, due to the limitations in the performance of base LLMs, the paper concludes that using VideoLLMs as GUI agents remains a significant challenge. The paper introduces GUI-Vid, a GUI-oriented VideoLLM with enhanced capabilities to handle various and complex GUI tasks. GUI-Vid shows significant improvement on the benchmark and achieves results comparable to the top-performing models. The paper also conducts extensive experiments and provides valuable insights into the performance of MLLMs in GUI-oriented tasks, highlighting the importance of vision perception, the number of keyframes, and image resolutions in improving performance. The paper discusses the challenges of GUI understanding, including the need for MLLMs to process sequential information and dynamic operations, and the limitations of current research in Web-based environments. The paper proposes a comprehensive dataset and benchmark for GUI understanding, encompassing seven mainstream MLLMs, three keyframe selection strategies, six GUI scenarios, and a diverse array of queries in multiple-choice, free-form, and conversational formats. The results indicate that most MLLMs struggle with GUI-WORLD, highlighting their limited dynamic understanding of graphical interfaces and the need for further enhancement. The paper also discusses the importance of keyframe selection for GUI-oriented tasks and the challenges of dynamic GUI tasks for MLLMs. The paper concludes that GUI-Vid, a VideoLLM fine-tuned on GUI-WORLD, significantly outperforms the baseline model, showing an average improvement of 30% across various tasks and GUI scenarios. However, GUI-Vid still lags behind industry leaders like GPT-4V and Gemini-Pro in certain tasks. The paper also highlights the potential of GUI-Vid in XR applications and the high-quality annotations provided by the dataset. The paper concludes that further researchGUI-WORLD is a new dataset designed to evaluate and enhance the capabilities of Multimodal Large Language Models (MLLMs) in understanding Graphical User Interface (GUI) content. The dataset includes over 12,000 GUI videos, covering six GUI scenarios and eight types of GUI-oriented questions in three formats. It features meticulously crafted Human-MLLM annotations, providing a comprehensive set of questions and instructions for evaluating MLLMs in GUI understanding tasks. The dataset includes detailed captions, human-annotated keyframes, and diverse types of QA, aiming to benchmark and improve the general GUI-oriented capabilities of MLLMs. The paper evaluates the performance of current state-of-the-art MLLMs, including ImageLLMs and VideoLLMs, in understanding various types of GUI content, especially dynamic and sequential content. The findings reveal that ImageLLMs struggle with dynamic GUI content without manually annotated keyframes or operation history. On the other hand, VideoLLMs fall short in all GUI-oriented tasks given the sparse GUI video dataset. Based on GUI-WORLD, the paper takes the initial step of leveraging a fine-tuned VideoLLM as a GUI agent, demonstrating an improved understanding of various GUI tasks. However, due to the limitations in the performance of base LLMs, the paper concludes that using VideoLLMs as GUI agents remains a significant challenge. The paper introduces GUI-Vid, a GUI-oriented VideoLLM with enhanced capabilities to handle various and complex GUI tasks. GUI-Vid shows significant improvement on the benchmark and achieves results comparable to the top-performing models. The paper also conducts extensive experiments and provides valuable insights into the performance of MLLMs in GUI-oriented tasks, highlighting the importance of vision perception, the number of keyframes, and image resolutions in improving performance. The paper discusses the challenges of GUI understanding, including the need for MLLMs to process sequential information and dynamic operations, and the limitations of current research in Web-based environments. The paper proposes a comprehensive dataset and benchmark for GUI understanding, encompassing seven mainstream MLLMs, three keyframe selection strategies, six GUI scenarios, and a diverse array of queries in multiple-choice, free-form, and conversational formats. The results indicate that most MLLMs struggle with GUI-WORLD, highlighting their limited dynamic understanding of graphical interfaces and the need for further enhancement. The paper also discusses the importance of keyframe selection for GUI-oriented tasks and the challenges of dynamic GUI tasks for MLLMs. The paper concludes that GUI-Vid, a VideoLLM fine-tuned on GUI-WORLD, significantly outperforms the baseline model, showing an average improvement of 30% across various tasks and GUI scenarios. However, GUI-Vid still lags behind industry leaders like GPT-4V and Gemini-Pro in certain tasks. The paper also highlights the potential of GUI-Vid in XR applications and the high-quality annotations provided by the dataset. The paper concludes that further research
Reach us at info@study.space