ANDROIDWORLD: A Dynamic Benchmarking Environment for Autonomous Agents

ANDROIDWORLD: A Dynamic Benchmarking Environment for Autonomous Agents

10 Jun 2024 | Christopher Rawles*, Sarah Clinckemaillie†, Yifan Chang†, Jonathan Waltz†, Gabrielle Lau†, Marybeth Fair†, Alice Li†, William Bishop†, Wei Li†, Folawiyọ Campbell-Ajala†, Daniel Toyama†, Robert Berry†, Divya Tyamagundlu†, Timothy Lillicrap†, and Oriana Riva†
ANDROIDWORLD is a dynamic benchmarking environment for autonomous agents that provides reward signals for 116 programmatic tasks across 20 real-world Android apps. Unlike existing interactive environments, which provide static test sets, ANDROIDWORLD dynamically constructs tasks that are parameterized and expressed in natural language in unlimited ways, enabling testing on a much larger and more realistic suite of tasks. Reward signals are derived from the computer's system state, making them durable across task variations and extensible across different apps. To demonstrate ANDROIDWORLD's benefits and mode of operation, we introduce a new computer control agent, M3A. M3A can complete 30.6% of ANDROIDWORLD's tasks, leaving ample room for future work. Furthermore, we adapt a popular desktop web agent to work on Android, which we find to be less effective on mobile, suggesting future research is needed to achieve universal, cross-domain agents. Finally, we conduct a robustness analysis by testing M3A against a range of task variations on a representative subset of tasks, demonstrating that variations in task parameters can significantly alter a task's complexity and, consequently, an agent's performance, highlighting the importance of testing agents under diverse conditions. ANDROIDWORLD is a lightweight environment, requiring only 2 GB of memory and 8 GB of disk space, and is designed with convenience in mind. It connects agents to Android OS by leveraging the Python library AndroidEnv to connect to the freely available Android Emulator. ANDROIDWORLD is highly extensible, making it easy to add new tasks and even new benchmarks, which we demonstrate by integrating the MiniWoB++ benchmark. We make the following contributions: (i) the creation of a new, highly diverse and realistic computer control agent environment; (ii) establishment of benchmark performance with a state-of-the-art multimodal agent, and (iii) a careful analysis demonstrating the need to evaluate agents across multiple test sets and conditions due to the inherent stochasticity in both models and environments.ANDROIDWORLD is a dynamic benchmarking environment for autonomous agents that provides reward signals for 116 programmatic tasks across 20 real-world Android apps. Unlike existing interactive environments, which provide static test sets, ANDROIDWORLD dynamically constructs tasks that are parameterized and expressed in natural language in unlimited ways, enabling testing on a much larger and more realistic suite of tasks. Reward signals are derived from the computer's system state, making them durable across task variations and extensible across different apps. To demonstrate ANDROIDWORLD's benefits and mode of operation, we introduce a new computer control agent, M3A. M3A can complete 30.6% of ANDROIDWORLD's tasks, leaving ample room for future work. Furthermore, we adapt a popular desktop web agent to work on Android, which we find to be less effective on mobile, suggesting future research is needed to achieve universal, cross-domain agents. Finally, we conduct a robustness analysis by testing M3A against a range of task variations on a representative subset of tasks, demonstrating that variations in task parameters can significantly alter a task's complexity and, consequently, an agent's performance, highlighting the importance of testing agents under diverse conditions. ANDROIDWORLD is a lightweight environment, requiring only 2 GB of memory and 8 GB of disk space, and is designed with convenience in mind. It connects agents to Android OS by leveraging the Python library AndroidEnv to connect to the freely available Android Emulator. ANDROIDWORLD is highly extensible, making it easy to add new tasks and even new benchmarks, which we demonstrate by integrating the MiniWoB++ benchmark. We make the following contributions: (i) the creation of a new, highly diverse and realistic computer control agent environment; (ii) establishment of benchmark performance with a state-of-the-art multimodal agent, and (iii) a careful analysis demonstrating the need to evaluate agents across multiple test sets and conditions due to the inherent stochasticity in both models and environments.
Reach us at info@study.space