Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

1 May 2024 | Pat Verga, Sebastian Hofstätter, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, Patrick Lewis
This paper evaluates the effectiveness of using a Panel of LLM evaluators (PoLL) to assess the quality of large language model (LLM) outputs, compared to using a single large model like GPT-4. As LLMs become more advanced, evaluating their performance accurately is challenging due to the difficulty of finding suitable test data and the complexity of assessing free-form generation. Traditional methods often rely on a single large model as a judge, but this approach can introduce intra-model bias and is costly. The authors propose PoLL, which uses a panel of smaller, diverse models to evaluate LLMs, reducing bias and cost while improving performance. The study evaluates PoLL across three settings: single-hop QA, multi-hop QA, and Chatbot Arena, using six datasets. Results show that PoLL outperforms a single large model in terms of correlation with human judgments, with a Cohen’s κ value of 0.726 for GPT-3.5 and higher for PoLL. PoLL also demonstrates lower variance in scores and better consistency across different tasks. Additionally, PoLL is significantly cheaper than using a single large model, being over seven times less expensive. The paper also highlights that GPT-4, while a powerful model, can exhibit high variance in judgments when prompted differently. By using a panel of diverse models, the study reduces intra-model bias and improves the reliability of evaluations. The findings suggest that PoLL is a more effective and cost-efficient method for evaluating LLMs, particularly in scenarios where a single large model may introduce bias or be too costly. The study also notes that while PoLL performs well, further research is needed to explore its applicability in other domains, such as math and reasoning tasks.This paper evaluates the effectiveness of using a Panel of LLM evaluators (PoLL) to assess the quality of large language model (LLM) outputs, compared to using a single large model like GPT-4. As LLMs become more advanced, evaluating their performance accurately is challenging due to the difficulty of finding suitable test data and the complexity of assessing free-form generation. Traditional methods often rely on a single large model as a judge, but this approach can introduce intra-model bias and is costly. The authors propose PoLL, which uses a panel of smaller, diverse models to evaluate LLMs, reducing bias and cost while improving performance. The study evaluates PoLL across three settings: single-hop QA, multi-hop QA, and Chatbot Arena, using six datasets. Results show that PoLL outperforms a single large model in terms of correlation with human judgments, with a Cohen’s κ value of 0.726 for GPT-3.5 and higher for PoLL. PoLL also demonstrates lower variance in scores and better consistency across different tasks. Additionally, PoLL is significantly cheaper than using a single large model, being over seven times less expensive. The paper also highlights that GPT-4, while a powerful model, can exhibit high variance in judgments when prompted differently. By using a panel of diverse models, the study reduces intra-model bias and improves the reliability of evaluations. The findings suggest that PoLL is a more effective and cost-efficient method for evaluating LLMs, particularly in scenarios where a single large model may introduce bias or be too costly. The study also notes that while PoLL performs well, further research is needed to explore its applicability in other domains, such as math and reasoning tasks.
Reach us at info@study.space
Understanding Replacing Judges with Juries%3A Evaluating LLM Generations with a Panel of Diverse Models