3 Jun 2024 | Zhumin Chu, Qingyao Ai, Yiteng Tu, Haitao Li, Yiqun Liu
This paper proposes a novel framework called Peer Review Evaluator (PRE) for automatically evaluating the performance of large language models (LLMs). Inspired by the peer review mechanism in academic publishing, the framework uses multiple LLMs as reviewers to evaluate the performance of other LLMs. The framework consists of three modules: a qualification exam module to select qualified LLMs as reviewers, a peer review module to collect and evaluate the outputs of evaluatee LLMs, and a "chair" decision module to aggregate the ratings from all reviewers and generate the final evaluation results. The qualification exam module selects LLMs with strong evaluation capabilities by testing their ability to evaluate other LLMs. The peer review module collects the outputs of evaluatee LLMs and has qualified reviewers rate them. The "chair" decision module aggregates the ratings from all reviewers to generate the final evaluation results. The framework is tested on two representative text generation tasks: text summarization and non-factoid question answering. The results show that the PRE model outperforms all baseline methods, including GPT-4, and demonstrates high consistency with human preferences. The framework is also shown to be robust and generalizable across different tasks and LLMs. The results indicate that using a single LLM as an evaluator can introduce bias, but the peer review mechanism can reduce this bias. The framework is designed to be cost-effective and scalable, making it a promising approach for evaluating LLMs.This paper proposes a novel framework called Peer Review Evaluator (PRE) for automatically evaluating the performance of large language models (LLMs). Inspired by the peer review mechanism in academic publishing, the framework uses multiple LLMs as reviewers to evaluate the performance of other LLMs. The framework consists of three modules: a qualification exam module to select qualified LLMs as reviewers, a peer review module to collect and evaluate the outputs of evaluatee LLMs, and a "chair" decision module to aggregate the ratings from all reviewers and generate the final evaluation results. The qualification exam module selects LLMs with strong evaluation capabilities by testing their ability to evaluate other LLMs. The peer review module collects the outputs of evaluatee LLMs and has qualified reviewers rate them. The "chair" decision module aggregates the ratings from all reviewers to generate the final evaluation results. The framework is tested on two representative text generation tasks: text summarization and non-factoid question answering. The results show that the PRE model outperforms all baseline methods, including GPT-4, and demonstrates high consistency with human preferences. The framework is also shown to be robust and generalizable across different tasks and LLMs. The results indicate that using a single LLM as an evaluator can introduce bias, but the peer review mechanism can reduce this bias. The framework is designed to be cost-effective and scalable, making it a promising approach for evaluating LLMs.