[slides and audio] Understanding the Weakness of Large Language Model Agents within a Complex Android Environment

This paper introduces AndroidArena, an environment and benchmark for evaluating large language model (LLM) agents in complex software systems, particularly Android. The environment supports cross-APP collaboration and constrained tasks, and includes a benchmark with annotated action sequences. The paper identifies key weaknesses in LLM agents, including poor understanding, reasoning, exploration, and reflection capabilities. It proposes adaptive metrics to evaluate task completion and fine-grained agent abilities. The results show that even state-of-the-art LLM agents struggle in cross-APP scenarios and adhering to specific constraints. The paper also presents an exploration strategy that improves the success rate by 27%. AndroidArena provides a scalable and semi-automated method for benchmark construction, and is open-sourced for further research. The findings highlight the need for future research in understanding and improving LLM agent capabilities, particularly in reflection and exploration. The paper also discusses the limitations of existing benchmarks and proposes a new approach for scalable task generation. The results demonstrate that LLM agents require further improvements in planning and decision-making abilities to effectively perform complex tasks in real-world environments.This paper introduces AndroidArena, an environment and benchmark for evaluating large language model (LLM) agents in complex software systems, particularly Android. The environment supports cross-APP collaboration and constrained tasks, and includes a benchmark with annotated action sequences. The paper identifies key weaknesses in LLM agents, including poor understanding, reasoning, exploration, and reflection capabilities. It proposes adaptive metrics to evaluate task completion and fine-grained agent abilities. The results show that even state-of-the-art LLM agents struggle in cross-APP scenarios and adhering to specific constraints. The paper also presents an exploration strategy that improves the success rate by 27%. AndroidArena provides a scalable and semi-automated method for benchmark construction, and is open-sourced for further research. The findings highlight the need for future research in understanding and improving LLM agent capabilities, particularly in reflection and exploration. The paper also discusses the limitations of existing benchmarks and proposes a new approach for scalable task generation. The results demonstrate that LLM agents require further improvements in planning and decision-making abilities to effectively perform complex tasks in real-world environments.

Understanding the Weakness of Large Language Model Agents within a Complex Android Environment

June 03-05, 2018 | Mingze Xing, Rongkai Zhang, Hui Xue, Qi Chen, Fan Yang, Zhen Xiao