DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents

DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents

10 Jun 2024 | Peter Jansen*, Marc-Alexandre Côté*, Tushar Khot*, Erin Bransom*, Bhavana Dalvi Mishra*, Bodhisattwa Prasad Majumder*, Oyvind Tafjord*, Peter Clark*
DISCOVERYWORLD is a virtual environment designed to develop and evaluate agents' ability to perform end-to-end scientific discovery. It includes 120 tasks across eight topics, each with three difficulty levels and parametric variations, requiring agents to form hypotheses, design experiments, analyze results, and draw conclusions. The environment is text-based with optional 2D visuals, allowing agents to interact with objects, use scientific equipment, and make observations. DISCOVERYWORLD provides three automatic evaluation metrics: task completion, task-relevant actions, and discovered explanatory knowledge. The environment is inspired by existing text-based simulations and is novel in its tasks and structure. It challenges agents to perform long-horizon discovery tasks without predefined solution approaches, requiring systematic search and analysis. DISCOVERYWORLD includes unit tests to distinguish between general discovery skills and task-specific knowledge. The environment is evaluated using three metrics: task completion, task process, and explanatory knowledge discovery. Baseline agents, including ReAct and Hypothesizer, performed poorly on most tasks, highlighting the challenges of end-to-end scientific discovery. Human scientists outperformed agents, with an average completion rate of 66% and knowledge performance of 55%. DISCOVERYWORLD aims to accelerate the development of general AI discovery agents by providing a benchmark for evaluating scientific discovery capabilities. The environment is open-sourced and available for research and development.DISCOVERYWORLD is a virtual environment designed to develop and evaluate agents' ability to perform end-to-end scientific discovery. It includes 120 tasks across eight topics, each with three difficulty levels and parametric variations, requiring agents to form hypotheses, design experiments, analyze results, and draw conclusions. The environment is text-based with optional 2D visuals, allowing agents to interact with objects, use scientific equipment, and make observations. DISCOVERYWORLD provides three automatic evaluation metrics: task completion, task-relevant actions, and discovered explanatory knowledge. The environment is inspired by existing text-based simulations and is novel in its tasks and structure. It challenges agents to perform long-horizon discovery tasks without predefined solution approaches, requiring systematic search and analysis. DISCOVERYWORLD includes unit tests to distinguish between general discovery skills and task-specific knowledge. The environment is evaluated using three metrics: task completion, task process, and explanatory knowledge discovery. Baseline agents, including ReAct and Hypothesizer, performed poorly on most tasks, highlighting the challenges of end-to-end scientific discovery. Human scientists outperformed agents, with an average completion rate of 66% and knowledge performance of 55%. DISCOVERYWORLD aims to accelerate the development of general AI discovery agents by providing a benchmark for evaluating scientific discovery capabilities. The environment is open-sourced and available for research and development.
Reach us at info@study.space
[slides] DISCOVERYWORLD%3A A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents | StudySpace