[slides] Mobile-Agent%3A Autonomous Multi-Modal Mobile Device Agent with Visual Perception

**Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception** This paper introduces Mobile-Agent, an autonomous multi-modal mobile device agent that leverages visual perception tools to accurately identify and locate visual and textual elements within app interfaces. Unlike previous solutions that rely on XML files or system metadata, Mobile-Agent uses only screenshots to navigate and perform complex operations in mobile apps. The agent plans and decomposes tasks based on the visual context, enabling it to operate independently across diverse mobile environments. To evaluate Mobile-Agent, the authors introduce Mobile-Eval, a benchmark that includes 10 commonly used apps and instructions with varying difficulty levels. Experiments show that Mobile-Agent achieves high accuracy and completion rates, even with challenging instructions involving multiple apps. The agent's self-planning and self-reflection capabilities ensure it can correct invalid operations and complete tasks successfully. The contributions of this work include: - Proposing Mobile-Agent, an autonomous mobile device agent with visual perception. - Introducing Mobile-Eval, a benchmark for evaluating mobile device agents. - Conducting comprehensive experiments to demonstrate Mobile-Agent's effectiveness and efficiency. The paper also discusses related work in LLM-based agents and agents for mobile devices, highlighting the advancements and challenges in these areas. Overall, Mobile-Agent represents a significant step forward in the development of versatile and adaptable mobile device assistants.**Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception** This paper introduces Mobile-Agent, an autonomous multi-modal mobile device agent that leverages visual perception tools to accurately identify and locate visual and textual elements within app interfaces. Unlike previous solutions that rely on XML files or system metadata, Mobile-Agent uses only screenshots to navigate and perform complex operations in mobile apps. The agent plans and decomposes tasks based on the visual context, enabling it to operate independently across diverse mobile environments. To evaluate Mobile-Agent, the authors introduce Mobile-Eval, a benchmark that includes 10 commonly used apps and instructions with varying difficulty levels. Experiments show that Mobile-Agent achieves high accuracy and completion rates, even with challenging instructions involving multiple apps. The agent's self-planning and self-reflection capabilities ensure it can correct invalid operations and complete tasks successfully. The contributions of this work include: - Proposing Mobile-Agent, an autonomous mobile device agent with visual perception. - Introducing Mobile-Eval, a benchmark for evaluating mobile device agents. - Conducting comprehensive experiments to demonstrate Mobile-Agent's effectiveness and efficiency. The paper also discusses related work in LLM-based agents and agents for mobile devices, highlighting the advancements and challenges in these areas. Overall, Mobile-Agent represents a significant step forward in the development of versatile and adaptable mobile device assistants.

MOBILE-AGENT: AUTONOMOUS MULTI-MODAL MOBILE DEVICE AGENT WITH VISUAL PERCEPTION

18 Apr 2024 | Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang