[slides and audio] Android in the Zoo%3A Chain-of-Action-Thought for GUI Agents

This paper introduces Chain-of-Action-Thought (CoAT), a novel paradigm for enhancing the navigation ability of large language models (LLMs) in graphical user interface (GUI) agents. CoAT explicitly captures the underlying semantics during navigation actions, including screen context, action thinking, action description, and action result. The authors construct the Android-In-The-Zoo (AITZ) dataset, the first and largest fine-grained dataset in Android GUI navigation, containing 18,643 screen-action pairs with detailed semantic annotations. Experiments show that CoAT significantly improves action prediction accuracy compared to standard context modeling methods, and fine-tuning a 1B model on AITZ achieves comparable performance to CogAgent-Chat-18B. The paper also discusses the limitations and ethical considerations of the proposed approach.This paper introduces Chain-of-Action-Thought (CoAT), a novel paradigm for enhancing the navigation ability of large language models (LLMs) in graphical user interface (GUI) agents. CoAT explicitly captures the underlying semantics during navigation actions, including screen context, action thinking, action description, and action result. The authors construct the Android-In-The-Zoo (AITZ) dataset, the first and largest fine-grained dataset in Android GUI navigation, containing 18,643 screen-action pairs with detailed semantic annotations. Experiments show that CoAT significantly improves action prediction accuracy compared to standard context modeling methods, and fine-tuning a 1B model on AITZ achieves comparable performance to CogAgent-Chat-18B. The paper also discusses the limitations and ethical considerations of the proposed approach.

Android in the Zoo: Chain-of-Action-Thought for GUI Agents

13 Jul 2024 | Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, Duyu Tang