1 May 2024 | Pat Verga, Sebastian Hofstätter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, Patrick Lewis
The paper "Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models" by Pat Verga et al. addresses the challenge of evaluating Large Language Models (LLMs) by proposing a novel approach called the Panel of LLM Evaluators (PoLL). The authors argue that using a single large model like GPT-4 as an evaluator introduces intra-model bias and is costly. Instead, they suggest using a panel of smaller models from different families to score LLM generations, which reduces bias and is more cost-effective.
The study involves three distinct judge settings and six datasets, including single-hop QA, multi-hop QA, and Chatbot Arena. The results show that PoLL outperforms a single large judge in terms of correlation with human judgments, exhibits less intra-model bias, and is over seven times cheaper. The authors also find that GPT-4, while powerful, is a relatively weak judge, especially in certain scenarios, and that prompt variations can significantly impact its performance.
The paper concludes that PoLL is a robust and effective method for evaluating LLM performance, reducing bias and costs while maintaining high correlation with human judgments. However, further research is needed to explore the broader applicability of PoLL and to optimize the selection of models for the panel.The paper "Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models" by Pat Verga et al. addresses the challenge of evaluating Large Language Models (LLMs) by proposing a novel approach called the Panel of LLM Evaluators (PoLL). The authors argue that using a single large model like GPT-4 as an evaluator introduces intra-model bias and is costly. Instead, they suggest using a panel of smaller models from different families to score LLM generations, which reduces bias and is more cost-effective.
The study involves three distinct judge settings and six datasets, including single-hop QA, multi-hop QA, and Chatbot Arena. The results show that PoLL outperforms a single large judge in terms of correlation with human judgments, exhibits less intra-model bias, and is over seven times cheaper. The authors also find that GPT-4, while powerful, is a relatively weak judge, especially in certain scenarios, and that prompt variations can significantly impact its performance.
The paper concludes that PoLL is a robust and effective method for evaluating LLM performance, reducing bias and costs while maintaining high correlation with human judgments. However, further research is needed to explore the broader applicability of PoLL and to optimize the selection of models for the panel.