Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

18 Apr 2024 | Junyang Wang, Haiyang Xu, Jiabao Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang
Mobile-Agent is an autonomous multi-modal mobile device agent that leverages visual perception tools to accurately identify and locate both visual and textual elements within the app's front-end interface. It autonomously plans and decomposes complex operation tasks, and navigates mobile apps step by step. Unlike previous solutions that rely on XML files or mobile system metadata, Mobile-Agent offers greater adaptability across diverse mobile operating environments in a vision-centric way, eliminating the need for system-specific customizations. To assess its performance, the authors introduced Mobile-Eval, a benchmark for evaluating mobile device operations. Based on Mobile-Eval, they conducted a comprehensive evaluation of Mobile-Agent, which achieved remarkable accuracy and completion rates. Even with challenging instructions, such as multi-app operations, Mobile-Agent can still complete the requirements. The code and model are open-sourced at https://github.com/X-PLUG/MobileAgent. Mobile-Agent utilizes GPT-4V, a state-of-the-art MLLM, and incorporates visual perception modules for operation localization. These modules include text detection and OCR models for describing screen content and identifying text. Through carefully crafted prompts, Mobile-Agent facilitates effective interaction with tools, enabling the automation of mobile device operations. It also features self-reflection capabilities to identify and correct erroneous operations. Mobile-Agent can self-plan tasks based on screenshots, user instructions, and operation history. Mobile-Eval includes 10 commonly used apps and instructions with varying difficulty levels. The experimental results show that Mobile-Agent achieves high completion rates and operation accuracy. It can handle complex instructions, such as operating multiple apps, effectively. The contributions include the proposal of Mobile-Agent, the introduction of Mobile-Eval, and a comprehensive analysis of Mobile-Agent based on Mobile-Eval. Mobile-Agent demonstrates strong performance in various tasks, including instruction comprehension, self-reflection, and multi-app operations. It is a versatile and adaptable solution for interacting with mobile applications in a language-agnostic manner.Mobile-Agent is an autonomous multi-modal mobile device agent that leverages visual perception tools to accurately identify and locate both visual and textual elements within the app's front-end interface. It autonomously plans and decomposes complex operation tasks, and navigates mobile apps step by step. Unlike previous solutions that rely on XML files or mobile system metadata, Mobile-Agent offers greater adaptability across diverse mobile operating environments in a vision-centric way, eliminating the need for system-specific customizations. To assess its performance, the authors introduced Mobile-Eval, a benchmark for evaluating mobile device operations. Based on Mobile-Eval, they conducted a comprehensive evaluation of Mobile-Agent, which achieved remarkable accuracy and completion rates. Even with challenging instructions, such as multi-app operations, Mobile-Agent can still complete the requirements. The code and model are open-sourced at https://github.com/X-PLUG/MobileAgent. Mobile-Agent utilizes GPT-4V, a state-of-the-art MLLM, and incorporates visual perception modules for operation localization. These modules include text detection and OCR models for describing screen content and identifying text. Through carefully crafted prompts, Mobile-Agent facilitates effective interaction with tools, enabling the automation of mobile device operations. It also features self-reflection capabilities to identify and correct erroneous operations. Mobile-Agent can self-plan tasks based on screenshots, user instructions, and operation history. Mobile-Eval includes 10 commonly used apps and instructions with varying difficulty levels. The experimental results show that Mobile-Agent achieves high completion rates and operation accuracy. It can handle complex instructions, such as operating multiple apps, effectively. The contributions include the proposal of Mobile-Agent, the introduction of Mobile-Eval, and a comprehensive analysis of Mobile-Agent based on Mobile-Eval. Mobile-Agent demonstrates strong performance in various tasks, including instruction comprehension, self-reflection, and multi-app operations. It is a versatile and adaptable solution for interacting with mobile applications in a language-agnostic manner.
Reach us at info@study.space
[slides and audio] Mobile-Agent%3A Autonomous Multi-Modal Mobile Device Agent with Visual Perception