On the Effects of Data Scale on Computer Control Agents

On the Effects of Data Scale on Computer Control Agents

25 Aug 2024 | Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, Oriana Riva
This paper investigates the effectiveness of fine-tuning on large language models (LLMs) for building computer control agents that can perform human tasks on Android devices. The authors introduce ANDROIDCONTROL, a dataset containing 15,283 demonstrations of everyday tasks with Android apps, which includes both high-level and low-level human-generated instructions. This dataset is the most diverse computer control dataset to date, covering 15,283 unique tasks across 833 Android apps. The study aims to answer two key questions: (1) how much data is needed to achieve a given performance level with fine-tuned models, and (2) the level of task complexity that fine-tuning can effectively handle. The results show that fine-tuned models outperform zero-shot and few-shot baselines in domain, but require significantly more data to achieve robust performance in out-of-domain tasks, especially for high-level tasks. Specifically, the authors predict that 1M episodes are needed for 95% accuracy on in-domain low-level tasks, while 2M episodes are required for 95% episode completion rates on 5-step high-level tasks. For out-of-domain tasks, 10M and 150M episodes are needed for low-level and high-level tasks, respectively. The paper also discusses the limitations of the study, including the use of a single model (PaLM-2S) and the potential for alternative routes to complete tasks. The authors conclude that while fine-tuning may be a viable approach for achieving high in-domain performance, it may not scale well out-of-domain and may not be sufficient for robust performance on high-level tasks.This paper investigates the effectiveness of fine-tuning on large language models (LLMs) for building computer control agents that can perform human tasks on Android devices. The authors introduce ANDROIDCONTROL, a dataset containing 15,283 demonstrations of everyday tasks with Android apps, which includes both high-level and low-level human-generated instructions. This dataset is the most diverse computer control dataset to date, covering 15,283 unique tasks across 833 Android apps. The study aims to answer two key questions: (1) how much data is needed to achieve a given performance level with fine-tuned models, and (2) the level of task complexity that fine-tuning can effectively handle. The results show that fine-tuned models outperform zero-shot and few-shot baselines in domain, but require significantly more data to achieve robust performance in out-of-domain tasks, especially for high-level tasks. Specifically, the authors predict that 1M episodes are needed for 95% accuracy on in-domain low-level tasks, while 2M episodes are required for 95% episode completion rates on 5-step high-level tasks. For out-of-domain tasks, 10M and 150M episodes are needed for low-level and high-level tasks, respectively. The paper also discusses the limitations of the study, including the use of a single model (PaLM-2S) and the potential for alternative routes to complete tasks. The authors conclude that while fine-tuning may be a viable approach for achieving high in-domain performance, it may not scale well out-of-domain and may not be sufficient for robust performance on high-level tasks.
Reach us at info@study.space