Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

3 Jun 2024 | Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang
The paper introduces Mobile-Agent-v2, a multi-agent architecture designed to enhance the navigation capabilities of mobile device operation tasks. The architecture consists of three agents: planning, decision, and reflection. The planning agent simplifies lengthy, interleaved image-text history operations into a pure-text task progress, reducing context length and improving decision-making. The decision agent navigates the task progress and updates a memory unit with focus content, ensuring efficient task execution. The reflection agent observes the outcomes of each operation and corrects erroneous actions. Experimental results show that Mobile-Agent-v2 achieves over 30% improvement in task completion compared to the single-agent Mobile-Agent, demonstrating the effectiveness of the multi-agent approach in handling long sequences of interleaved text and images. The code is open-sourced at https://github.com/X-PLUG/MobileAgent.The paper introduces Mobile-Agent-v2, a multi-agent architecture designed to enhance the navigation capabilities of mobile device operation tasks. The architecture consists of three agents: planning, decision, and reflection. The planning agent simplifies lengthy, interleaved image-text history operations into a pure-text task progress, reducing context length and improving decision-making. The decision agent navigates the task progress and updates a memory unit with focus content, ensuring efficient task execution. The reflection agent observes the outcomes of each operation and corrects erroneous actions. Experimental results show that Mobile-Agent-v2 achieves over 30% improvement in task completion compared to the single-agent Mobile-Agent, demonstrating the effectiveness of the multi-agent approach in handling long sequences of interleaved text and images. The code is open-sourced at https://github.com/X-PLUG/MobileAgent.
Reach us at info@study.space