Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate

Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate

30 Jan 2024 | Steffi Chern, Ethan Chern, Graham Neubig, Pengfei Liu
The paper addresses the challenge of evaluating Large Language Models (LLMs) as evaluators across diverse tasks and scenarios. Traditional evaluation methods often rely on LLMs to assess their own responses, but this approach is limited by the coverage of existing benchmarks and requires extensive human annotation. To overcome these challenges, the authors propose SCALE-EVAL, a scalable meta-evaluation framework that leverages multiple LLM agents in a multi-agent debate to assist human annotators in determining the most capable LLMs as evaluators. This framework supports multi-round discussions, reducing the workload typically required for large-scale annotations in traditional meta-evaluation. The authors also release the code for their framework, making it publicly available for further research and development. The paper includes several experiments to validate the effectiveness of SCALE-EVAL. These experiments include meta-meta evaluation, comparing the results of SCALE-EVAL with those from human expert annotations, and assessing the reliability and cost-performance trade-off of different LLMs as evaluators under various scenarios. Additionally, the authors examine how variations in criteria prompts affect the performance of LLMs as evaluators. The results demonstrate that SCALE-EVAL achieves high agreement rates with human expert annotations, indicating its reliability and effectiveness in meta-evaluation. The framework also reveals the capabilities and limitations of different LLMs as evaluators, providing insights into their performance characteristics under various criteria and scenarios. Overall, the paper contributes to the field by offering a scalable and reliable solution for evaluating LLMs as evaluators, which is crucial for ensuring the quality and reliability of LLM-generated outputs in diverse applications.The paper addresses the challenge of evaluating Large Language Models (LLMs) as evaluators across diverse tasks and scenarios. Traditional evaluation methods often rely on LLMs to assess their own responses, but this approach is limited by the coverage of existing benchmarks and requires extensive human annotation. To overcome these challenges, the authors propose SCALE-EVAL, a scalable meta-evaluation framework that leverages multiple LLM agents in a multi-agent debate to assist human annotators in determining the most capable LLMs as evaluators. This framework supports multi-round discussions, reducing the workload typically required for large-scale annotations in traditional meta-evaluation. The authors also release the code for their framework, making it publicly available for further research and development. The paper includes several experiments to validate the effectiveness of SCALE-EVAL. These experiments include meta-meta evaluation, comparing the results of SCALE-EVAL with those from human expert annotations, and assessing the reliability and cost-performance trade-off of different LLMs as evaluators under various scenarios. Additionally, the authors examine how variations in criteria prompts affect the performance of LLMs as evaluators. The results demonstrate that SCALE-EVAL achieves high agreement rates with human expert annotations, indicating its reliability and effectiveness in meta-evaluation. The framework also reveals the capabilities and limitations of different LLMs as evaluators, providing insights into their performance characteristics under various criteria and scenarios. Overall, the paper contributes to the field by offering a scalable and reliable solution for evaluating LLMs as evaluators, which is crucial for ensuring the quality and reliability of LLM-generated outputs in diverse applications.
Reach us at info@study.space