August 2018 | YUPENG CHANG, XU WANG, JINDONG WANG, YUAN WU, LINYI YANG, KAIJIE ZHU, HAO CHEN, XIAOYUAN YI, CUNXIANG WANG, YIDONG WANG, WEI YE, YUE ZHANG, YI CHANG, PHILIP S. YU, QIANG YANG, XING XIE
A survey on the evaluation of large language models (LLMs) is presented, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. LLMs have gained significant attention due to their performance in various applications. Evaluation is crucial for understanding their strengths and weaknesses, ensuring safety and reliability, and addressing potential risks. The paper reviews existing evaluation methods, including tasks in natural language processing, reasoning, ethics, biases, social sciences, and medical applications. It discusses the importance of evaluating LLMs in different contexts, such as datasets, benchmarks, and real-world scenarios. The paper also highlights challenges in evaluating LLMs, including the need for new evaluation protocols and the limitations of current methods. It emphasizes the importance of evaluating LLMs for their safety, reliability, and ethical implications. The paper concludes that LLMs have significant potential in various applications but require further research and development to improve their performance and address their limitations. The authors provide a comprehensive overview of LLM evaluations, including their strengths and weaknesses, and suggest future research directions for improving LLM evaluations. The paper also discusses the importance of open-source materials for LLM evaluations and the need for collaborative efforts in this area. The authors highlight the importance of evaluating LLMs in various domains, including social sciences, legal tasks, and psychology, and suggest that further research is needed to improve their performance and address their limitations. The paper concludes that LLM evaluations are essential for the development of more proficient LLMs and that future research should focus on improving their performance and addressing their limitations.A survey on the evaluation of large language models (LLMs) is presented, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. LLMs have gained significant attention due to their performance in various applications. Evaluation is crucial for understanding their strengths and weaknesses, ensuring safety and reliability, and addressing potential risks. The paper reviews existing evaluation methods, including tasks in natural language processing, reasoning, ethics, biases, social sciences, and medical applications. It discusses the importance of evaluating LLMs in different contexts, such as datasets, benchmarks, and real-world scenarios. The paper also highlights challenges in evaluating LLMs, including the need for new evaluation protocols and the limitations of current methods. It emphasizes the importance of evaluating LLMs for their safety, reliability, and ethical implications. The paper concludes that LLMs have significant potential in various applications but require further research and development to improve their performance and address their limitations. The authors provide a comprehensive overview of LLM evaluations, including their strengths and weaknesses, and suggest future research directions for improving LLM evaluations. The paper also discusses the importance of open-source materials for LLM evaluations and the need for collaborative efforts in this area. The authors highlight the importance of evaluating LLMs in various domains, including social sciences, legal tasks, and psychology, and suggest that further research is needed to improve their performance and address their limitations. The paper concludes that LLM evaluations are essential for the development of more proficient LLMs and that future research should focus on improving their performance and addressing their limitations.