3 Jun 2024 | Zhumin Chu, Qingyao Ai, Yiteng Tu, Haitao Li, Yiqun Liu
The paper introduces a novel framework called Peer Review Evaluator (PRE) for automatically evaluating the performance of large language models (LLMs). The framework is inspired by the peer review system used in academic publications and aims to address the limitations of existing evaluation paradigms, which often suffer from high costs, low generalizability, and inherent biases. PRE uses a peer-review process where a small set of powerful LLMs are selected as "reviewers" to evaluate the outputs of other LLMs, which are referred to as "submissions." The final ranking of the LLMs is generated based on the ratings provided by all reviewers.
The authors conducted extensive experiments on two tasks: text summarization and non-factoid question answering, using eleven LLMs, including GPT-4. The results demonstrate that PRE outperforms all baselines, including GPT-4, in terms of both cost efficiency and robustness. The study also reveals that using a single LLM for evaluation can lead to significant bias, which is mitigated by the PRE framework. The framework's effectiveness is further validated through experiments that vary hyperparameters and qualification methods, showing that PRE remains stable and reliable.
In conclusion, PRE provides a novel and automatic method for evaluating LLMs, addressing the challenges of existing evaluation paradigms and demonstrating its potential for broader application in various evaluation tasks and scenarios.The paper introduces a novel framework called Peer Review Evaluator (PRE) for automatically evaluating the performance of large language models (LLMs). The framework is inspired by the peer review system used in academic publications and aims to address the limitations of existing evaluation paradigms, which often suffer from high costs, low generalizability, and inherent biases. PRE uses a peer-review process where a small set of powerful LLMs are selected as "reviewers" to evaluate the outputs of other LLMs, which are referred to as "submissions." The final ranking of the LLMs is generated based on the ratings provided by all reviewers.
The authors conducted extensive experiments on two tasks: text summarization and non-factoid question answering, using eleven LLMs, including GPT-4. The results demonstrate that PRE outperforms all baselines, including GPT-4, in terms of both cost efficiency and robustness. The study also reveals that using a single LLM for evaluation can lead to significant bias, which is mitigated by the PRE framework. The framework's effectiveness is further validated through experiments that vary hyperparameters and qualification methods, showing that PRE remains stable and reliable.
In conclusion, PRE provides a novel and automatic method for evaluating LLMs, addressing the challenges of existing evaluation paradigms and demonstrating its potential for broader application in various evaluation tasks and scenarios.