Understanding SciCode%3A A Research Coding Benchmark Curated by Scientists

SciCode is a research coding benchmark curated by scientists to evaluate the ability of language models (LMs) to generate code for solving real scientific research problems. The benchmark includes 80 main problems, each decomposed into 338 subproblems, covering 16 diverse natural science fields. Each problem provides scientific background, detailed instructions, and gold-standard solutions for evaluation. The problems are sourced from scientists' everyday research tasks or influential papers, ensuring relevance to real-world applications. SciCode is designed to test LMs' comprehensive capabilities, including knowledge recall, reasoning, and code synthesis. The benchmark includes optional descriptions specifying scientific background information and scientist-annotated gold-standard solutions and test cases. The best-performing model, Claude3.5-Sonnet, can solve only 4.6% of the problems in the most realistic setting. SciCode aims to overcome the challenges of current LM evaluations by introducing value-added design choices, including a focus on natural science fields, abundant high-quality data, high annotation quality, realistic problems, and problems with zero overlap with publicly available datasets. The benchmark also provides opportunities to evaluate various model capabilities in varied setups, such as whether to provide scientific background information or to condition on gold or generated solutions to previous subproblems. SciCode is a very challenging benchmark, as demonstrated by the results showing that even the best models struggle to solve most of the problems. The benchmark is designed to motivate research into developing new AI methods for accelerating scientific research.SciCode is a research coding benchmark curated by scientists to evaluate the ability of language models (LMs) to generate code for solving real scientific research problems. The benchmark includes 80 main problems, each decomposed into 338 subproblems, covering 16 diverse natural science fields. Each problem provides scientific background, detailed instructions, and gold-standard solutions for evaluation. The problems are sourced from scientists' everyday research tasks or influential papers, ensuring relevance to real-world applications. SciCode is designed to test LMs' comprehensive capabilities, including knowledge recall, reasoning, and code synthesis. The benchmark includes optional descriptions specifying scientific background information and scientist-annotated gold-standard solutions and test cases. The best-performing model, Claude3.5-Sonnet, can solve only 4.6% of the problems in the most realistic setting. SciCode aims to overcome the challenges of current LM evaluations by introducing value-added design choices, including a focus on natural science fields, abundant high-quality data, high annotation quality, realistic problems, and problems with zero overlap with publicly available datasets. The benchmark also provides opportunities to evaluate various model capabilities in varied setups, such as whether to provide scientific background information or to condition on gold or generated solutions to previous subproblems. SciCode is a very challenging benchmark, as demonstrated by the results showing that even the best models struggle to solve most of the problems. The benchmark is designed to motivate research into developing new AI methods for accelerating scientific research.

SciCode: A Research Coding Benchmark Curated by Scientists