A Survey on Evaluation of Large Language Models

A Survey on Evaluation of Large Language Models

August 2018 | YUPENG CHANG* and XU WANG*, School of Artificial Intelligence, Jilin University, China JINDONG WANG†, Microsoft Research Asia, China YUAN WU†, School of Artificial Intelligence, Jilin University, China LINYI YANG, Westlake University, China KAIJIE ZHU, Institute of Automation, Chinese Academy of Sciences, China HAO CHEN, Carnegie Mellon University, USA XIAOYUAN YI, Microsoft Research Asia, China CUNXIANG WANG, Westlake University, China YIDONG WANG, Peking University, China WEI YE, Peking University, China YUE ZHANG, Westlake University, China YI CHANG, School of Artificial Intelligence, Jilin University, China PHILIP S. YU, University of Illinois at Chicago, USA QIANG YANG, Hong Kong University of Science and Technology, China XING XIE, Microsoft Research Asia, China
This paper provides a comprehensive review of evaluation methods for large language models (LLMs), focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. The authors, from various institutions in China and the USA, highlight the importance of LLMs in both academic and industrial settings due to their exceptional performance in various applications. The paper covers a wide range of evaluation tasks, including natural language processing, reasoning, medical usage, ethics, education, natural and social sciences, and agent applications. It also discusses the selection of appropriate datasets and benchmarks for evaluation and explores current evaluation protocols and novel approaches. The authors summarize the success and failure cases of LLMs in different tasks, emphasizing the need for more in-depth research and optimization. Additionally, the paper addresses future challenges in LLMs evaluation, such as ensuring safety and reliability in critical sectors and designing new evaluation protocols. The authors aim to offer valuable insights to researchers in the field of LLMs evaluation and foster a collaborative community for better evaluations.This paper provides a comprehensive review of evaluation methods for large language models (LLMs), focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. The authors, from various institutions in China and the USA, highlight the importance of LLMs in both academic and industrial settings due to their exceptional performance in various applications. The paper covers a wide range of evaluation tasks, including natural language processing, reasoning, medical usage, ethics, education, natural and social sciences, and agent applications. It also discusses the selection of appropriate datasets and benchmarks for evaluation and explores current evaluation protocols and novel approaches. The authors summarize the success and failure cases of LLMs in different tasks, emphasizing the need for more in-depth research and optimization. Additionally, the paper addresses future challenges in LLMs evaluation, such as ensuring safety and reliability in critical sectors and designing new evaluation protocols. The authors aim to offer valuable insights to researchers in the field of LLMs evaluation and foster a collaborative community for better evaluations.
Reach us at info@study.space
[slides and audio] A Survey on Evaluation of Large Language Models