2 Jul 2024 | Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, Ziluo Ding, Boyu Li, Bohan Zhou, Junpeng Yue, Jiechuan Jiang, Yewen Li, Ruyi An, Molei Qin, Chuqiao Zong, Longtao Zheng, Yujie Wu, Xiaoqiang Chai, Yifei Bi, Tianbao Xie, Pengjie Gu, Xiyun Li, Ceyao Zhang, Long Tian, Chaofie Wang, Xinrun Wang, Börje F. Karlsson, Bo An, Shuicheng Yan, Zongqing Lu
CRADLE is a modular and flexible framework powered by large multimodal models (LMMs) designed to enable foundation agents to perform complex computer tasks through a unified interface. The framework uses screenshots as input and keyboard/mouse actions as output, allowing agents to interact with any software without relying on built-in APIs. CRADLE consists of six key modules: information gathering, self-reflection, task inference, skill curation, action planning, and memory. These modules work together to understand input screenshots, generate executable code for low-level control, and complete long-horizon tasks. Experimental results show that CRADLE exhibits remarkable generalization across four commercial video games, five software applications, and a comprehensive benchmark, OSWorld. It is the first to enable foundation agents to follow the main storyline and complete 40-minute-long real missions in the complex AAA game Red Dead Redemption 2 (RDR2). CRADLE can also create a city of a thousand people in Cities: Skylines, farm and harvest parsnips in Stardew Valley, and trade and bargain with a maximal weekly total profit of 87% in Dealer's Life 2. It can operate daily software like Chrome, Outlook, and Feishu, and edit images and videos using Meitu and CapCut. CRADLE extends the reach of foundation agents by enabling the easy conversion of any software into benchmarks to evaluate agents' abilities. The framework is open-source and aims to accelerate the development of more powerful foundation agents, advancing the path towards Artificial General Intelligence (AGI).CRADLE is a modular and flexible framework powered by large multimodal models (LMMs) designed to enable foundation agents to perform complex computer tasks through a unified interface. The framework uses screenshots as input and keyboard/mouse actions as output, allowing agents to interact with any software without relying on built-in APIs. CRADLE consists of six key modules: information gathering, self-reflection, task inference, skill curation, action planning, and memory. These modules work together to understand input screenshots, generate executable code for low-level control, and complete long-horizon tasks. Experimental results show that CRADLE exhibits remarkable generalization across four commercial video games, five software applications, and a comprehensive benchmark, OSWorld. It is the first to enable foundation agents to follow the main storyline and complete 40-minute-long real missions in the complex AAA game Red Dead Redemption 2 (RDR2). CRADLE can also create a city of a thousand people in Cities: Skylines, farm and harvest parsnips in Stardew Valley, and trade and bargain with a maximal weekly total profit of 87% in Dealer's Life 2. It can operate daily software like Chrome, Outlook, and Feishu, and edit images and videos using Meitu and CapCut. CRADLE extends the reach of foundation agents by enabling the easy conversion of any software into benchmarks to evaluate agents' abilities. The framework is open-source and aims to accelerate the development of more powerful foundation agents, advancing the path towards Artificial General Intelligence (AGI).