[slides and audio] AndroidWorld%3A A Dynamic Benchmarking Environment for Autonomous Agents

**ANDROIDWORLD: A Dynamic Benchmarking Environment for Autonomous Agents** **Christopher Rawles et al.** **Abstract:** This paper introduces ANDROIDWORLD, a fully functional Android environment designed to enhance the development and evaluation of autonomous agents. ANDROIDWORLD provides reward signals for 116 programmatic tasks across 20 real-world Android apps, dynamically constructing tasks that are parameterized and expressed in natural language. Unlike static test sets in existing interactive environments, ANDROIDWORLD enables testing on a larger and more realistic suite of tasks. Reward signals are derived from the computer's system state, making them durable and extensible across different apps. To demonstrate ANDROIDWORLD's benefits, the authors introduce M3A, a new computer control agent. M3A achieves a 30.6% success rate on ANDROIDWORLD tasks, highlighting the need for future work. The paper also adapts a popular desktop web agent to work on Android, finding it less effective on mobile, suggesting the need for cross-domain agents. A robustness analysis is conducted by testing M3A under varied task conditions, demonstrating significant performance variations due to changes in task parameters. **Contributions:** 1. Creation of a highly diverse and realistic computer control agent environment. 2. Establishment of benchmark performance with a state-of-the-art multimodal agent. 3. Careful analysis showing the importance of evaluating agents across multiple test sets and conditions. **Related Work:** The paper compares existing evaluation environments for autonomous agents, highlighting the limitations of static datasets and the need for dynamic, real-world-like environments. It also discusses the challenges of mobile environments compared to desktop environments, such as simpler UIs and more complex action spaces. **ANDROIDWORLD:** - **Observation and Action Space:** ANDROIDWORLD provides an interface for agents to receive observations and execute actions, using AndroidEnv and the Android Device Bridge. - **Reproducible and Parameterized Tasks:** The environment consists of 116 tasks across 20 apps, with each task dynamically generated using random parameters. - **Durable Rewards from System State:** Reward signals are derived from the Android Debug Bridge, ensuring accuracy and durability. - **Task Composability:** Composite tasks can be created by combining existing tasks, enhancing the complexity and realism of the environment. **Experiments:** - **M3A and SeeAct Baseline:** M3A, a multimodal agent, and SeeAct, a text-only baseline, are evaluated on ANDROIDWORLD and MobileMiniWoB++. M3A achieves a higher success rate, demonstrating its effectiveness. - **Agent Robustness:** Experiments show that agents struggle with mobile UIs, grounding, and memory-intensive tasks, highlighting the importance of extensive testing under varied conditions. **Conclusion:** ANDROIDWORLD is a robust environment for developing and evaluating autonomous agents, enabling realistic and scalable testing. The dynamic nature of ANDROIDWORLD supports**ANDROIDWORLD: A Dynamic Benchmarking Environment for Autonomous Agents** **Christopher Rawles et al.** **Abstract:** This paper introduces ANDROIDWORLD, a fully functional Android environment designed to enhance the development and evaluation of autonomous agents. ANDROIDWORLD provides reward signals for 116 programmatic tasks across 20 real-world Android apps, dynamically constructing tasks that are parameterized and expressed in natural language. Unlike static test sets in existing interactive environments, ANDROIDWORLD enables testing on a larger and more realistic suite of tasks. Reward signals are derived from the computer's system state, making them durable and extensible across different apps. To demonstrate ANDROIDWORLD's benefits, the authors introduce M3A, a new computer control agent. M3A achieves a 30.6% success rate on ANDROIDWORLD tasks, highlighting the need for future work. The paper also adapts a popular desktop web agent to work on Android, finding it less effective on mobile, suggesting the need for cross-domain agents. A robustness analysis is conducted by testing M3A under varied task conditions, demonstrating significant performance variations due to changes in task parameters. **Contributions:** 1. Creation of a highly diverse and realistic computer control agent environment. 2. Establishment of benchmark performance with a state-of-the-art multimodal agent. 3. Careful analysis showing the importance of evaluating agents across multiple test sets and conditions. **Related Work:** The paper compares existing evaluation environments for autonomous agents, highlighting the limitations of static datasets and the need for dynamic, real-world-like environments. It also discusses the challenges of mobile environments compared to desktop environments, such as simpler UIs and more complex action spaces. **ANDROIDWORLD:** - **Observation and Action Space:** ANDROIDWORLD provides an interface for agents to receive observations and execute actions, using AndroidEnv and the Android Device Bridge. - **Reproducible and Parameterized Tasks:** The environment consists of 116 tasks across 20 apps, with each task dynamically generated using random parameters. - **Durable Rewards from System State:** Reward signals are derived from the Android Debug Bridge, ensuring accuracy and durability. - **Task Composability:** Composite tasks can be created by combining existing tasks, enhancing the complexity and realism of the environment. **Experiments:** - **M3A and SeeAct Baseline:** M3A, a multimodal agent, and SeeAct, a text-only baseline, are evaluated on ANDROIDWORLD and MobileMiniWoB++. M3A achieves a higher success rate, demonstrating its effectiveness. - **Agent Robustness:** Experiments show that agents struggle with mobile UIs, grounding, and memory-intensive tasks, highlighting the importance of extensive testing under varied conditions. **Conclusion:** ANDROIDWORLD is a robust environment for developing and evaluating autonomous agents, enabling realistic and scalable testing. The dynamic nature of ANDROIDWORLD supports

ANDROIDWORLD: A Dynamic Benchmarking Environment for Autonomous Agents

10 Jun 2024 | Christopher Rawles*1, Sarah Clinicemaillie†2, Yifan Chang†2, Jonathan Waltz2, Gabrielle Lau2, Marybeth Fair2, Alice Li1, William Bishop1, Wei Li1, Folawiyo Campbell-Ajala1, Daniel Toyama1, Robert Berry1, Divya Tyamagundlu2, Timothy Lillicrap1, and Oriana Riva1