The paper introduces SciCode, a research coding benchmark curated by scientists to evaluate the capabilities of language models (LMs) in generating code for solving real scientific research problems. The benchmark includes 80 main problems, each decomposed into multiple subproblems, totaling 338 subproblems. These problems cover 16 diverse natural science fields, including mathematics, physics, chemistry, biology, and materials science. Each problem provides scientific background and detailed instructions, and includes gold-standard solutions and test cases for evaluation. The evaluation setup is designed to be challenging, with the best-performing model, Claude3.5-Sonnet, solving only 4.6% of the main problems in the most realistic setting. The paper also discusses the design principles, annotation process, and evaluation setups of SciCode, highlighting its potential to motivate research in developing new AI methods for accelerating scientific research.The paper introduces SciCode, a research coding benchmark curated by scientists to evaluate the capabilities of language models (LMs) in generating code for solving real scientific research problems. The benchmark includes 80 main problems, each decomposed into multiple subproblems, totaling 338 subproblems. These problems cover 16 diverse natural science fields, including mathematics, physics, chemistry, biology, and materials science. Each problem provides scientific background and detailed instructions, and includes gold-standard solutions and test cases for evaluation. The evaluation setup is designed to be challenging, with the best-performing model, Claude3.5-Sonnet, solving only 4.6% of the main problems in the most realistic setting. The paper also discusses the design principles, annotation process, and evaluation setups of SciCode, highlighting its potential to motivate research in developing new AI methods for accelerating scientific research.