[slides and audio] Evaluatology%3A The Science and Engineering of Evaluation

The article introduces the discipline of *evaluatology*, which encompasses the science and engineering of evaluation. It aims to address the lack of consensus on universal concepts, terminologies, theories, and methodologies in various fields. The authors propose a universal framework for evaluation, including five axioms that form the foundation of evaluation theory. These axioms focus on key aspects of evaluation outcomes, such as true quantity, traceability of discrepancies, comparability, and realistic estimates. The core essence of evaluation is described as conducting deliberate experiments where a well-defined Evaluation Condition (EC) is applied to diverse subjects, resulting in the establishment of equivalent Evaluation Models (EMs). The article outlines a rigorous methodology for evaluating a single subject, emphasizing the creation of equivalent ECs (EECs) and the use of Reference Evaluation Models (REMs) to eliminate confounding variables. The authors also discuss the establishment of EECs, the hierarchical definition of an EC, and universal concepts across different disciplines. They provide case studies to illustrate the application of these concepts in evaluating AI algorithms, drugs, and policies. The article concludes by proposing a benchmark-based engineering approach to evaluation, referred to as "bencharkology," which aims to simplify and standardize evaluation processes across various disciplines.The article introduces the discipline of *evaluatology*, which encompasses the science and engineering of evaluation. It aims to address the lack of consensus on universal concepts, terminologies, theories, and methodologies in various fields. The authors propose a universal framework for evaluation, including five axioms that form the foundation of evaluation theory. These axioms focus on key aspects of evaluation outcomes, such as true quantity, traceability of discrepancies, comparability, and realistic estimates. The core essence of evaluation is described as conducting deliberate experiments where a well-defined Evaluation Condition (EC) is applied to diverse subjects, resulting in the establishment of equivalent Evaluation Models (EMs). The article outlines a rigorous methodology for evaluating a single subject, emphasizing the creation of equivalent ECs (EECs) and the use of Reference Evaluation Models (REMs) to eliminate confounding variables. The authors also discuss the establishment of EECs, the hierarchical definition of an EC, and universal concepts across different disciplines. They provide case studies to illustrate the application of these concepts in evaluating AI algorithms, drugs, and policies. The article concludes by proposing a benchmark-based engineering approach to evaluation, referred to as "bencharkology," which aims to simplify and standardize evaluation processes across various disciplines.

Evaluatology: The Science and Engineering of Evaluation

MARCH 19, 2024 | Jianfeng Zhan, Lei Wang, Wanling Gao, Hongxiao Li, Chenxi Wang, Yunyou Huang, Yatao Li, Zhengxin Yang, Guoxin Kang, Chunjie Luo, Hainan Ye, Shaopeng Dai, Zhifei Zhang