[slides and audio] DISCOVERYWORLD%3A A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents

**DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents** **Authors:** Peter Jansen, Marc-Alexandre Côté, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Oyvind Tafjord, Peter Clark **Institution:** Allen Institute for Artificial Intelligence, Microsoft Research, University of Arizona **Abstract:** Automated scientific discovery has the potential to accelerate progress across various scientific domains. However, developing and evaluating AI agents' end-to-end scientific reasoning capabilities is challenging due to the high cost or infeasibility of running real-world experiments. This paper introduces DISCOVERYWORLD, a virtual environment designed to address this challenge. DISCOVERYWORLD is an inexpensive, text-based, simulated environment with optional 2D visual overlays. It includes 120 different challenge tasks across eight diverse topics, each with three levels of difficulty and several parametric variations. Each task requires an agent to form hypotheses, design and run experiments, analyze results, and act on conclusions. DISCOVERYWORLD provides three automatic metrics for evaluating performance: task completion, task-relevant actions taken, and discovered explanatory knowledge. The authors find that strong baseline agents struggle on most DISCOVERYWORLD tasks, suggesting that the environment captures novel challenges in scientific discovery. The code for DISCOVERYWORLD is available on GitHub. **Introduction:** The goal of DISCOVERYWORLD is to develop systems that can perform the full end-to-end research process, from ideation to hypothesis formation, experiment design, data collection, analysis, and conclusion drawing. The environment is inspired by existing text-based simulation environments but is novel in its scope and tasks. It aims to cover a broad range of discovery topics and encourage the development of general discovery skills rather than task-specific solutions. **Contributions:** - Introduction of DISCOVERYWORLD, the first virtual environment for benchmarking agents' general ability to perform complete cycles of novel scientific discovery. - A comprehensive evaluation set of 120 tasks spanning eight diverse topics, each with three levels of difficulty and several parametric variations. - An evaluation framework for automatically assessing agents' performance in DISCOVERYWORLD. - Baseline results showing that contemporary agent models struggle with the novel challenges in DISCOVERYWORLD. **Discussion:** The authors discuss the performance of human scientists and baseline agents on DISCOVERYWORLD tasks, highlighting the gap between current agent models and human performance. They conclude that while expert human scientists find the challenge tasks difficult, strong agent baselines struggle to complete most tasks or discover critical explanatory knowledge. **Conclusion:** DISCOVERYWORLD is a valuable tool for developing and benchmarking agents' ability to perform end-to-end scientific discovery. The authors hope it will inspire and accelerate the development of new, general AI discovery agents.**DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents** **Authors:** Peter Jansen, Marc-Alexandre Côté, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Oyvind Tafjord, Peter Clark **Institution:** Allen Institute for Artificial Intelligence, Microsoft Research, University of Arizona **Abstract:** Automated scientific discovery has the potential to accelerate progress across various scientific domains. However, developing and evaluating AI agents' end-to-end scientific reasoning capabilities is challenging due to the high cost or infeasibility of running real-world experiments. This paper introduces DISCOVERYWORLD, a virtual environment designed to address this challenge. DISCOVERYWORLD is an inexpensive, text-based, simulated environment with optional 2D visual overlays. It includes 120 different challenge tasks across eight diverse topics, each with three levels of difficulty and several parametric variations. Each task requires an agent to form hypotheses, design and run experiments, analyze results, and act on conclusions. DISCOVERYWORLD provides three automatic metrics for evaluating performance: task completion, task-relevant actions taken, and discovered explanatory knowledge. The authors find that strong baseline agents struggle on most DISCOVERYWORLD tasks, suggesting that the environment captures novel challenges in scientific discovery. The code for DISCOVERYWORLD is available on GitHub. **Introduction:** The goal of DISCOVERYWORLD is to develop systems that can perform the full end-to-end research process, from ideation to hypothesis formation, experiment design, data collection, analysis, and conclusion drawing. The environment is inspired by existing text-based simulation environments but is novel in its scope and tasks. It aims to cover a broad range of discovery topics and encourage the development of general discovery skills rather than task-specific solutions. **Contributions:** - Introduction of DISCOVERYWORLD, the first virtual environment for benchmarking agents' general ability to perform complete cycles of novel scientific discovery. - A comprehensive evaluation set of 120 tasks spanning eight diverse topics, each with three levels of difficulty and several parametric variations. - An evaluation framework for automatically assessing agents' performance in DISCOVERYWORLD. - Baseline results showing that contemporary agent models struggle with the novel challenges in DISCOVERYWORLD. **Discussion:** The authors discuss the performance of human scientists and baseline agents on DISCOVERYWORLD tasks, highlighting the gap between current agent models and human performance. They conclude that while expert human scientists find the challenge tasks difficult, strong agent baselines struggle to complete most tasks or discover critical explanatory knowledge. **Conclusion:** DISCOVERYWORLD is a valuable tool for developing and benchmarking agents' ability to perform end-to-end scientific discovery. The authors hope it will inspire and accelerate the development of new, general AI discovery agents.

DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents

10 Jun 2024 | Peter Jansen*,†, Marc-Alexandre Côté†, Tushar Khot*, Erin Bransom*, Bhavana Dalvi Mishra*, Bodhisattwa Prasad Majumder*, Oyvind Tafjord*, Peter Clark*

10 Jun 2024 | Peter Jansen,†, Marc-Alexandre Côté†, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Oyvind Tafjord, Peter Clark*