OSWORLD is a new benchmark for evaluating multimodal agents in real computer environments, designed to support open-ended tasks across various operating systems like Ubuntu, Windows, and macOS. It provides a scalable, interactive environment for task setup, execution-based evaluation, and learning. OSWORLD includes 369 real-world computer tasks involving web and desktop applications, OS file I/O, and multi-app workflows. Each task is based on real-world use cases and includes detailed setup configurations and evaluation scripts for reliable assessment. The benchmark evaluates the performance of state-of-the-art LLM and VLM-based agents, revealing significant deficiencies in their ability to perform as computer assistants. Humans can complete over 72.36% of the tasks, while the best model achieves only 12.24% success, struggling with GUI grounding and operational knowledge. OSWORLD provides a comprehensive analysis of task complexity and evaluation challenges, offering insights for developing multimodal generalist agents. The environment, benchmark, and data are publicly available for research. OSWORLD enables agents to interact with real computer applications, supporting execution-based evaluation and interactive learning. It includes a wide range of tasks, from image viewing to software integration and programming, and supports various operating systems. The benchmark includes 134 unique evaluation functions, significantly more than previous work, showcasing the diversity and complexity of tasks. The environment allows for parallel execution and headless operation, facilitating research on generalist computer agents. The benchmark includes tasks that require interactions with multiple applications and interfaces, reflecting the complexity of real-world computer use. The results highlight the need for advanced models and techniques to handle the challenges of open-ended tasks in real computer environments.OSWORLD is a new benchmark for evaluating multimodal agents in real computer environments, designed to support open-ended tasks across various operating systems like Ubuntu, Windows, and macOS. It provides a scalable, interactive environment for task setup, execution-based evaluation, and learning. OSWORLD includes 369 real-world computer tasks involving web and desktop applications, OS file I/O, and multi-app workflows. Each task is based on real-world use cases and includes detailed setup configurations and evaluation scripts for reliable assessment. The benchmark evaluates the performance of state-of-the-art LLM and VLM-based agents, revealing significant deficiencies in their ability to perform as computer assistants. Humans can complete over 72.36% of the tasks, while the best model achieves only 12.24% success, struggling with GUI grounding and operational knowledge. OSWORLD provides a comprehensive analysis of task complexity and evaluation challenges, offering insights for developing multimodal generalist agents. The environment, benchmark, and data are publicly available for research. OSWORLD enables agents to interact with real computer applications, supporting execution-based evaluation and interactive learning. It includes a wide range of tasks, from image viewing to software integration and programming, and supports various operating systems. The benchmark includes 134 unique evaluation functions, significantly more than previous work, showcasing the diversity and complexity of tasks. The environment allows for parallel execution and headless operation, facilitating research on generalist computer agents. The benchmark includes tasks that require interactions with multiple applications and interfaces, reflecting the complexity of real-world computer use. The results highlight the need for advanced models and techniques to handle the challenges of open-ended tasks in real computer environments.