SciCode: A Research Coding Benchmark Curated by Scientists

SciCode: A Research Coding Benchmark Curated by Scientists

18 Jul 2024 | Minyang Tian, Luyu Gao, Shizhuo Dylan Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas, Pan Ji, Kittithat Krongchon, Yao Li, Shengyan Liu, Di Luo, Yutao Ma, Hao Tong, Kha Trinh, Chenyu Tian, Zihan Wang, Bohao Wu, Yanyu Xiong, Shengzhu Yin, Minhui Zhu, Kilian Lieret, Yanxin Lu, Genglin Liu, Yufeng Du, Tianhua Tao, Ofir Press, Jamie Callan, Eliu Huerta, Hao Peng
The paper introduces SciCode, a research coding benchmark curated by scientists to evaluate the capabilities of language models (LMs) in generating code for solving real scientific research problems. The benchmark includes 80 main problems, each decomposed into multiple subproblems, totaling 338 subproblems. These problems cover 16 diverse natural science fields, including mathematics, physics, chemistry, biology, and materials science. Each problem provides scientific background and detailed instructions, and includes gold-standard solutions and test cases for evaluation. The evaluation setup is designed to be challenging, with the best-performing model, Claude3.5-Sonnet, solving only 4.6% of the main problems in the most realistic setting. The paper also discusses the design principles, annotation process, and evaluation setups of SciCode, highlighting its potential to motivate research in developing new AI methods for accelerating scientific research.The paper introduces SciCode, a research coding benchmark curated by scientists to evaluate the capabilities of language models (LMs) in generating code for solving real scientific research problems. The benchmark includes 80 main problems, each decomposed into multiple subproblems, totaling 338 subproblems. These problems cover 16 diverse natural science fields, including mathematics, physics, chemistry, biology, and materials science. Each problem provides scientific background and detailed instructions, and includes gold-standard solutions and test cases for evaluation. The evaluation setup is designed to be challenging, with the best-performing model, Claude3.5-Sonnet, solving only 4.6% of the main problems in the most realistic setting. The paper also discusses the design principles, annotation process, and evaluation setups of SciCode, highlighting its potential to motivate research in developing new AI methods for accelerating scientific research.
Reach us at info@study.space