On the Effects of Data Scale on Computer Control Agents

On the Effects of Data Scale on Computer Control Agents

25 Aug 2024 | Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyọ Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva
This paper investigates the impact of data scale on the performance of computer control agents, particularly when using large language models (LLMs). The authors introduce ANDROIDCONTROL, a new dataset containing 15,283 demonstrations of everyday tasks performed on Android apps. Each task includes both high-level and low-level human-generated instructions, allowing for analysis of task complexity and model performance in and out of domain. The dataset is diverse, covering 833 Android apps and 15,283 unique tasks, enabling in-depth analysis of model performance across different domains. The study finds that domain fine-tuned models outperform zero and few-shot baselines, with performance scaling as more data is collected. However, out-of-domain performance scales more slowly, suggesting that fine-tuning on more data alone may not be sufficient for robust out-of-domain performance, especially for high-level tasks. The authors also find that low-level tasks are easier to handle and require less data for robust performance compared to high-level tasks. The paper evaluates various LLMs, including PaLM-2L, PaLM-2S, Gemini 1.5 Pro, GPT-4, and GPT-4 Turbo, and compares their performance on both low- and high-level tasks. The results show that fine-tuned models achieve higher accuracy on both types of tasks, with the best fine-tuned model reaching 71.5% on high-level and 86.6% on low-level tasks. However, out-of-domain performance requires significantly more data, with 10M and 150M episodes needed to achieve 95% accuracy on low- and high-level tasks, respectively. The study highlights the importance of data scale in achieving robust performance for computer control agents, particularly for high-level tasks. While fine-tuning can improve performance, it may not be sufficient for achieving robust out-of-domain performance without additional approaches. The authors conclude that fine-tuning may be a viable, though potentially expensive, route for achieving high in-domain performance for both low and high-level tasks, but may not be sufficient for robust out-of-domain performance on high-level tasks.This paper investigates the impact of data scale on the performance of computer control agents, particularly when using large language models (LLMs). The authors introduce ANDROIDCONTROL, a new dataset containing 15,283 demonstrations of everyday tasks performed on Android apps. Each task includes both high-level and low-level human-generated instructions, allowing for analysis of task complexity and model performance in and out of domain. The dataset is diverse, covering 833 Android apps and 15,283 unique tasks, enabling in-depth analysis of model performance across different domains. The study finds that domain fine-tuned models outperform zero and few-shot baselines, with performance scaling as more data is collected. However, out-of-domain performance scales more slowly, suggesting that fine-tuning on more data alone may not be sufficient for robust out-of-domain performance, especially for high-level tasks. The authors also find that low-level tasks are easier to handle and require less data for robust performance compared to high-level tasks. The paper evaluates various LLMs, including PaLM-2L, PaLM-2S, Gemini 1.5 Pro, GPT-4, and GPT-4 Turbo, and compares their performance on both low- and high-level tasks. The results show that fine-tuned models achieve higher accuracy on both types of tasks, with the best fine-tuned model reaching 71.5% on high-level and 86.6% on low-level tasks. However, out-of-domain performance requires significantly more data, with 10M and 150M episodes needed to achieve 95% accuracy on low- and high-level tasks, respectively. The study highlights the importance of data scale in achieving robust performance for computer control agents, particularly for high-level tasks. While fine-tuning can improve performance, it may not be sufficient for achieving robust out-of-domain performance without additional approaches. The authors conclude that fine-tuning may be a viable, though potentially expensive, route for achieving high in-domain performance for both low and high-level tasks, but may not be sufficient for robust out-of-domain performance on high-level tasks.
Reach us at info@study.space