OSWORLD: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

OSWORLD: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

30 May 2024 | Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, Tao Yu
OSWORLD is a novel, scalable, and real computer environment designed for benchmarking multimodal agents in open-ended tasks. It supports task setup, execution-based evaluation, and interactive learning across various operating systems (Ubuntu, Windows, macOS). The environment includes a benchmark of 369 real-world computer tasks, involving web and desktop apps, OS file I/O, and workflows spanning multiple applications. Each task is detailed with an initial state setup configuration and a custom execution-based evaluation script for reliable and reproducible assessment. Extensive evaluation of state-of-the-art LLM/VM-based agents on OSWORLD reveals significant deficiencies in their ability to serve as computer assistants, with the best model achieving only 12.24% success, primarily struggling with GUI grounding and operational knowledge. The analysis provides valuable insights for developing more generalist multimodal agents. The code, environment, baseline models, and data are publicly available at <https://os-world.github.io>.OSWORLD is a novel, scalable, and real computer environment designed for benchmarking multimodal agents in open-ended tasks. It supports task setup, execution-based evaluation, and interactive learning across various operating systems (Ubuntu, Windows, macOS). The environment includes a benchmark of 369 real-world computer tasks, involving web and desktop apps, OS file I/O, and workflows spanning multiple applications. Each task is detailed with an initial state setup configuration and a custom execution-based evaluation script for reliable and reproducible assessment. Extensive evaluation of state-of-the-art LLM/VM-based agents on OSWORLD reveals significant deficiencies in their ability to serve as computer assistants, with the best model achieving only 12.24% success, primarily struggling with GUI grounding and operational knowledge. The analysis provides valuable insights for developing more generalist multimodal agents. The code, environment, baseline models, and data are publicly available at <https://os-world.github.io>.
Reach us at info@study.space
Understanding OSWorld%3A Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments