[slides and audio] Cradle%3A Empowering Foundation Agents Towards General Computer Control

CRADLE is a modular and flexible framework designed to empower foundation agents to perform complex computer tasks through a unified interface, using screenshots as input and keyboard and mouse operations as output. The framework aims to address the challenge of generalizing across various virtual scenarios by restricting agents to interact with software through a standardized interface. CRADLE consists of six key modules: Information Gathering, Self-Reflection, Task Inference, Skill Curation, Action Planning, and Memory. These modules enable CRADLE to understand input screenshots, generate executable code for low-level control, and interact with any software to complete long-horizon tasks without relying on built-in APIs. The framework has been tested on four commercial video games (Red Dead Redemption 2, Stardew Valley, Dealer's Life 2, and Cities: Skylines) and five software applications (Chrome, Outlook, CapCut, Meitu, and Feishu). CRADLE demonstrates remarkable generalizability and performance, completing complex missions, managing tasks, and interacting with various software applications. Notably, CRADLE is the first to enable foundation agents to follow the main storyline and complete 40-minute-long real missions in Red Dead Redemption 2, create a city of a thousand people in Cities: Skylines, farm and harvest parsnips in Stardew Valley, and achieve a weekly total profit of 87% in Dealer's Life 2. The framework's success in these environments highlights its potential to extend the capabilities of foundation agents and facilitate data collection for further self-improvement. However, limitations include struggles with out-of-distribution tasks and the need for improvements in audio processing and reducing interaction costs with LMMs. Future work will focus on enhancing LMM capabilities, enabling audio processing, and improving efficiency in task execution.CRADLE is a modular and flexible framework designed to empower foundation agents to perform complex computer tasks through a unified interface, using screenshots as input and keyboard and mouse operations as output. The framework aims to address the challenge of generalizing across various virtual scenarios by restricting agents to interact with software through a standardized interface. CRADLE consists of six key modules: Information Gathering, Self-Reflection, Task Inference, Skill Curation, Action Planning, and Memory. These modules enable CRADLE to understand input screenshots, generate executable code for low-level control, and interact with any software to complete long-horizon tasks without relying on built-in APIs. The framework has been tested on four commercial video games (Red Dead Redemption 2, Stardew Valley, Dealer's Life 2, and Cities: Skylines) and five software applications (Chrome, Outlook, CapCut, Meitu, and Feishu). CRADLE demonstrates remarkable generalizability and performance, completing complex missions, managing tasks, and interacting with various software applications. Notably, CRADLE is the first to enable foundation agents to follow the main storyline and complete 40-minute-long real missions in Red Dead Redemption 2, create a city of a thousand people in Cities: Skylines, farm and harvest parsnips in Stardew Valley, and achieve a weekly total profit of 87% in Dealer's Life 2. The framework's success in these environments highlights its potential to extend the capabilities of foundation agents and facilitate data collection for further self-improvement. However, limitations include struggles with out-of-distribution tasks and the need for improvements in audio processing and reducing interaction costs with LMMs. Future work will focus on enhancing LMM capabilities, enabling audio processing, and improving efficiency in task execution.

CRADLE: Empowering Foundation Agents Towards General Computer Control