Android in the Zoo: Chain-of-Action-Thought for GUI Agents

Android in the Zoo: Chain-of-Action-Thought for GUI Agents

13 Jul 2024 | Jiwen Zhang, Jihao Wu, Yihua Teng, Minhui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, Duyu Tang
This paper introduces Chain-of-Action-Thought (CoAT), a novel prompting paradigm for GUI agents that explicitly captures the underlying semantics during navigation actions. CoAT enables GUI agents to perceive, think, and decide in an interleaved manner by incorporating screen descriptions, action thinking, next action descriptions, and action results. The authors propose the Android-In-The-Zoo (AITZ) dataset, which contains 18,643 screen-action pairs and four types of semantic annotations, spanning over 70 Android apps. AITZ is the first and largest fine-grained dataset in the Android GUI navigation field. The dataset is constructed by leveraging GPT-4V and state-of-the-art icon detection models to generate candidate answers for screen descriptions, action thinkings, and next action descriptions, which are then validated and refined by human annotators. Experiments show that fine-tuning a 1B model on AITZ achieves performance comparable to CogAgent-Chat-18B. The authors also compare CoAT with three typical prompting methods for GUI tasks, including Standard, Chain-of-Action, and Chain-of-Thought, and demonstrate that CoAT significantly improves action prediction. The results show that CoAT helps agents adapt to GUI tasks better and more quickly. The paper also discusses the limitations of the proposed method and the ethical considerations of the dataset.This paper introduces Chain-of-Action-Thought (CoAT), a novel prompting paradigm for GUI agents that explicitly captures the underlying semantics during navigation actions. CoAT enables GUI agents to perceive, think, and decide in an interleaved manner by incorporating screen descriptions, action thinking, next action descriptions, and action results. The authors propose the Android-In-The-Zoo (AITZ) dataset, which contains 18,643 screen-action pairs and four types of semantic annotations, spanning over 70 Android apps. AITZ is the first and largest fine-grained dataset in the Android GUI navigation field. The dataset is constructed by leveraging GPT-4V and state-of-the-art icon detection models to generate candidate answers for screen descriptions, action thinkings, and next action descriptions, which are then validated and refined by human annotators. Experiments show that fine-tuning a 1B model on AITZ achieves performance comparable to CogAgent-Chat-18B. The authors also compare CoAT with three typical prompting methods for GUI tasks, including Standard, Chain-of-Action, and Chain-of-Thought, and demonstrate that CoAT significantly improves action prediction. The results show that CoAT helps agents adapt to GUI tasks better and more quickly. The paper also discusses the limitations of the proposed method and the ethical considerations of the dataset.
Reach us at info@study.space
[slides and audio] Android in the Zoo%3A Chain-of-Action-Thought for GUI Agents