Understanding A Survey of Useful LLM Evaluation

This paper presents a two-stage framework for evaluating Large Language Models (LLMs) to determine their usefulness as tools. The framework first assesses the "core ability" of LLMs, which includes reasoning, societal impact, and domain knowledge. It then evaluates LLMs as "agents," focusing on planning, application scenarios, and benchmarking. The core ability evaluation examines LLMs' reasoning capabilities across logical, mathematical, commonsense, multi-hop, and structured data reasoning tasks. It also addresses societal impact, including safety, security, and ethical considerations, and evaluates LLMs' domain knowledge in areas such as finance, legislation, psychology, medicine, and education. The agent evaluation focuses on planning, application scenarios, and benchmarking, including web grounding, code generation, database queries, API calls, tool creation, and robotic navigation. The study highlights the importance of developing refined evaluation methods to ensure LLMs are reliable, effective, and safe. It also identifies challenges and future directions for LLM evaluation, emphasizing the need for comprehensive and domain-specific benchmarks to assess LLMs' capabilities in various tasks and domains. The paper provides a systematic review of existing evaluation methods and proposes a two-stage framework to examine whether LLMs are sufficiently useful tools.This paper presents a two-stage framework for evaluating Large Language Models (LLMs) to determine their usefulness as tools. The framework first assesses the "core ability" of LLMs, which includes reasoning, societal impact, and domain knowledge. It then evaluates LLMs as "agents," focusing on planning, application scenarios, and benchmarking. The core ability evaluation examines LLMs' reasoning capabilities across logical, mathematical, commonsense, multi-hop, and structured data reasoning tasks. It also addresses societal impact, including safety, security, and ethical considerations, and evaluates LLMs' domain knowledge in areas such as finance, legislation, psychology, medicine, and education. The agent evaluation focuses on planning, application scenarios, and benchmarking, including web grounding, code generation, database queries, API calls, tool creation, and robotic navigation. The study highlights the importance of developing refined evaluation methods to ensure LLMs are reliable, effective, and safe. It also identifies challenges and future directions for LLM evaluation, emphasizing the need for comprehensive and domain-specific benchmarks to assess LLMs' capabilities in various tasks and domains. The paper provides a systematic review of existing evaluation methods and proposes a two-stage framework to examine whether LLMs are sufficiently useful tools.

A Survey of Useful LLM Evaluation

3 Jun 2024 | Ji-Lun Peng, Sijia Cheng, Egil Dlau, Yung-Yu Shih, Po-Heng Chen, Yen-Ting Lin, Yun-Nung Chen