SCIAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis

SCIAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis

18 Jun 2024 | Hengxing Cai, Xiaochen Cai, Junhan Chang, Sihang Li, Lin Yao, Changxin Wang, Zhifeng Gao, Hongshuai Wang, Yongge Li, Mujie Lin, Shuwen Yang, Jiankun Wang, Mingjun Xu, Jin Huang, Fang Xi, Jiaxi Zhuang, Yuqi Yin, Yaqi Li, Changhong Chen, Zheng Cheng, Zifeng Zhao, Linfeng Zhang, Guolin Ke
The paper introduces SciAssess, a comprehensive benchmark designed to evaluate the proficiency of Large Language Models (LLMs) in scientific literature analysis. SciAssess aims to address the limitations of existing benchmarks by focusing on three levels of ability: Memorization (L1), Comprehension (L2), and Analysis & Reasoning (L3). The benchmark covers a wide range of scientific fields, including fundamental science, alloy materials, biomedicine, drug discovery, and organic materials, with 29 tasks in total. Each task is designed to assess different aspects of LLMs' capabilities, such as extracting information from text, charts, molecular structures, and tables. The evaluation includes rigorous quality control measures to ensure accuracy, anonymization, and compliance with copyright standards. The performance of 11 LLMs, including GPT, Claude, and Gemini, is assessed, highlighting their strengths and areas for improvement. The insights gained from SciAssess are intended to support the development of LLMs for scientific literature analysis, ultimately contributing to scientific discovery and innovation.The paper introduces SciAssess, a comprehensive benchmark designed to evaluate the proficiency of Large Language Models (LLMs) in scientific literature analysis. SciAssess aims to address the limitations of existing benchmarks by focusing on three levels of ability: Memorization (L1), Comprehension (L2), and Analysis & Reasoning (L3). The benchmark covers a wide range of scientific fields, including fundamental science, alloy materials, biomedicine, drug discovery, and organic materials, with 29 tasks in total. Each task is designed to assess different aspects of LLMs' capabilities, such as extracting information from text, charts, molecular structures, and tables. The evaluation includes rigorous quality control measures to ensure accuracy, anonymization, and compliance with copyright standards. The performance of 11 LLMs, including GPT, Claude, and Gemini, is assessed, highlighting their strengths and areas for improvement. The insights gained from SciAssess are intended to support the development of LLMs for scientific literature analysis, ultimately contributing to scientific discovery and innovation.
Reach us at info@study.space