MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents

MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents

12 Jun 2024 | Luyuan Wang, Yongyu Deng, Yiwei Zha, Guodong Mao, Qinmin Wang, Tianchen Min, Wei Chen, Shoufa Chen
MobileAgentBench is an efficient and user-friendly benchmark for evaluating mobile large language model (LLM) agents. The benchmark is designed to simplify the evaluation process by reducing the need for extensive manual testing. It includes 100 tasks across 10 open-source apps, categorized by difficulty levels. The benchmark evaluates existing mobile agents, such as AppAgent and MobileAgent, to provide a systematic comparison of their performance. The benchmark is accessible via its project webpage, contributing to advancements in both academic and industrial fields. MobileAgentBench is designed to run on real Android devices and supports both physical devices and emulators. It automatically switches to the next task after the agent stops or exceeds the maximum steps, requiring no human supervision. The task success is determined by checking the final UI state rather than the action sequence, which is more reliable and efficient. The benchmark uses Android Accessibility Service to capture app events and forward them to the benchmark server, ensuring accurate task evaluation. The benchmark includes 100 tasks across 10 daily apps, designed to simulate normal user activities with varying difficulty levels. The tasks are categorized into easy, medium, and hard based on the minimum steps required to complete them. The benchmark is user-friendly and requires minimal code changes for integration. It supports a wide range of testing tasks across various Android operating system versions and executes on actual devices. The benchmark evaluates the performance of several mobile LLM agents, including AndroidArena, AutoDroid, AppAgent, CogAgent, and MobileAgent. The results show that AppAgent has the highest success rate, benefiting from its self-exploration mechanism. CogAgent has the lowest success rate, likely due to its naive agent implementation. The benchmark defines six metrics to comprehensively evaluate the performance of mobile agents, including success rate, step-wise efficiency, latency, tokens, false negative rate, and false positive rate. The benchmark addresses several limitations of existing benchmarks, including scalability, robustness, and realistic environment. It provides a reliable and efficient way to evaluate mobile LLM agents, making it accessible to developers and researchers from non-Android development communities. The benchmark is designed to be flexible and easy to extend, supporting a broad spectrum of testing tasks. It also introduces an innovative method for determining the task-terminating state, making the benchmark resistant to the complexities of tracking multiple potential success pathways. This approach ensures that MobileAgentBench provides reliable and precise benchmarking outcomes.MobileAgentBench is an efficient and user-friendly benchmark for evaluating mobile large language model (LLM) agents. The benchmark is designed to simplify the evaluation process by reducing the need for extensive manual testing. It includes 100 tasks across 10 open-source apps, categorized by difficulty levels. The benchmark evaluates existing mobile agents, such as AppAgent and MobileAgent, to provide a systematic comparison of their performance. The benchmark is accessible via its project webpage, contributing to advancements in both academic and industrial fields. MobileAgentBench is designed to run on real Android devices and supports both physical devices and emulators. It automatically switches to the next task after the agent stops or exceeds the maximum steps, requiring no human supervision. The task success is determined by checking the final UI state rather than the action sequence, which is more reliable and efficient. The benchmark uses Android Accessibility Service to capture app events and forward them to the benchmark server, ensuring accurate task evaluation. The benchmark includes 100 tasks across 10 daily apps, designed to simulate normal user activities with varying difficulty levels. The tasks are categorized into easy, medium, and hard based on the minimum steps required to complete them. The benchmark is user-friendly and requires minimal code changes for integration. It supports a wide range of testing tasks across various Android operating system versions and executes on actual devices. The benchmark evaluates the performance of several mobile LLM agents, including AndroidArena, AutoDroid, AppAgent, CogAgent, and MobileAgent. The results show that AppAgent has the highest success rate, benefiting from its self-exploration mechanism. CogAgent has the lowest success rate, likely due to its naive agent implementation. The benchmark defines six metrics to comprehensively evaluate the performance of mobile agents, including success rate, step-wise efficiency, latency, tokens, false negative rate, and false positive rate. The benchmark addresses several limitations of existing benchmarks, including scalability, robustness, and realistic environment. It provides a reliable and efficient way to evaluate mobile LLM agents, making it accessible to developers and researchers from non-Android development communities. The benchmark is designed to be flexible and easy to extend, supporting a broad spectrum of testing tasks. It also introduces an innovative method for determining the task-terminating state, making the benchmark resistant to the complexities of tracking multiple potential success pathways. This approach ensures that MobileAgentBench provides reliable and precise benchmarking outcomes.
Reach us at info@study.space