3 Jun 2024 | Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, Jitao Sang
Mobile-Agent-v2 is a multi-agent architecture designed to enhance navigation in mobile device operation tasks. The system consists of three agents: planning, decision, and reflection agents. The planning agent condenses lengthy, interleaved image-text history operations into a pure-text task progress, which is then passed to the decision agent. This reduces context length, making it easier for the decision agent to navigate task progress. A memory unit is designed to retain focus content from history screens, which is updated by the decision agent. The reflection agent observes the outcomes of each operation and handles any mistakes accordingly. Experimental results show that Mobile-Agent-v2 achieves over a 30% improvement in task completion compared to the single-agent architecture of Mobile-Agent. The code is open-sourced at https://github.com/X-PLUG/MobileAgent.
Mobile device operation tasks involve multi-step sequential processing. The operator needs to perform a series of continuous operations on the device starting from the initial screen until the instructions are fully executed. Two main challenges in this process are planning the operation intent and navigating focus content from history screens. The planning agent generates task progress based on history operations, ensuring effective operation generation by the decision agent. A memory unit is designed to store focus content from history screens, which is updated by the decision agent. The reflection agent assesses whether the decision agent’s operation meets expectations and generates appropriate remedial measures if expectations are not met.
The system uses a visual perception module to enhance screen recognition capability. This module includes text recognition, icon recognition, and icon description tools. The planning agent summarizes history operations and tracks task progress. The decision agent generates operations based on current task progress, screen state, and reflection results. The reflection agent observes the screens before and after the decision agent's operation to determine if the operation meets expectations. If the operation is erroneous or ineffective, the reflection agent communicates the result to the decision agent, which rethinks and implements the correct operation.
Experiments show that Mobile-Agent-v2 achieves significant performance improvements in task completion, success rate, and completion rate compared to the single-agent architecture. The system also demonstrates improved performance when manual operation knowledge is injected. The multi-agent architecture and memory unit of Mobile-Agent-v2 play a crucial role in handling long sequences in UI operation tasks. The system is evaluated across various operating systems, language environments, and applications, demonstrating its effectiveness in both English and non-English scenarios. The results indicate that Mobile-Agent-v2 is more effective in handling complex tasks and multi-app operations. The system is also evaluated with different MLLMs, showing that GPT-4V combined with the agent architecture remains the most effective configuration for operational capabilities. The system is further validated through ablation studies, showing that the planning agent has the most significant impact on the overall framework. The reflection agent is essential for correcting erroneous operations, and the memory unit is crucial for successful execution in multi-app scenarios. The system is also evaluated through case studies, demonstratingMobile-Agent-v2 is a multi-agent architecture designed to enhance navigation in mobile device operation tasks. The system consists of three agents: planning, decision, and reflection agents. The planning agent condenses lengthy, interleaved image-text history operations into a pure-text task progress, which is then passed to the decision agent. This reduces context length, making it easier for the decision agent to navigate task progress. A memory unit is designed to retain focus content from history screens, which is updated by the decision agent. The reflection agent observes the outcomes of each operation and handles any mistakes accordingly. Experimental results show that Mobile-Agent-v2 achieves over a 30% improvement in task completion compared to the single-agent architecture of Mobile-Agent. The code is open-sourced at https://github.com/X-PLUG/MobileAgent.
Mobile device operation tasks involve multi-step sequential processing. The operator needs to perform a series of continuous operations on the device starting from the initial screen until the instructions are fully executed. Two main challenges in this process are planning the operation intent and navigating focus content from history screens. The planning agent generates task progress based on history operations, ensuring effective operation generation by the decision agent. A memory unit is designed to store focus content from history screens, which is updated by the decision agent. The reflection agent assesses whether the decision agent’s operation meets expectations and generates appropriate remedial measures if expectations are not met.
The system uses a visual perception module to enhance screen recognition capability. This module includes text recognition, icon recognition, and icon description tools. The planning agent summarizes history operations and tracks task progress. The decision agent generates operations based on current task progress, screen state, and reflection results. The reflection agent observes the screens before and after the decision agent's operation to determine if the operation meets expectations. If the operation is erroneous or ineffective, the reflection agent communicates the result to the decision agent, which rethinks and implements the correct operation.
Experiments show that Mobile-Agent-v2 achieves significant performance improvements in task completion, success rate, and completion rate compared to the single-agent architecture. The system also demonstrates improved performance when manual operation knowledge is injected. The multi-agent architecture and memory unit of Mobile-Agent-v2 play a crucial role in handling long sequences in UI operation tasks. The system is evaluated across various operating systems, language environments, and applications, demonstrating its effectiveness in both English and non-English scenarios. The results indicate that Mobile-Agent-v2 is more effective in handling complex tasks and multi-app operations. The system is also evaluated with different MLLMs, showing that GPT-4V combined with the agent architecture remains the most effective configuration for operational capabilities. The system is further validated through ablation studies, showing that the planning agent has the most significant impact on the overall framework. The reflection agent is essential for correcting erroneous operations, and the memory unit is crucial for successful execution in multi-app scenarios. The system is also evaluated through case studies, demonstrating