Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

2 Jul 2024 | Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, Dieuweke Hupkes
This paper evaluates the performance of various large language models (LLMs) as judges in assessing the quality of responses from other LLMs. The study uses the TriviaQA benchmark to compare nine judge models with nine exam-taker models, including both base and instruction-tuned versions. The researchers assess how well these models align with human judgments and identify potential biases and limitations in using LLMs as judges. Key findings include that while Llama-3 70B and GPT-4 Turbo show excellent alignment with humans, they are outperformed by other models like JudgeLM-7B and the lexical matching method "contains" in ranking exam-taker models. Cohen's kappa is found to be a better metric for measuring alignment than simple percent agreement, as it accounts for chance agreement. However, even models with high kappa scores may have systematic biases. The study also highlights that judge models struggle with ambiguous or under-specified answers and are sensitive to prompt length and specificity. They tend to be lenient in their evaluations and may fail to correctly assess responses that match reference answers verbatim. Additionally, some models are easily fooled by simple responses like "Yes" or "Sure." The research underscores the importance of using multiple metrics and qualitative analysis when evaluating LLMs as judges, as well as the need for caution in relying solely on alignment metrics. The study contributes to the understanding of the strengths and limitations of LLMs as judges and highlights the need for further research to improve their reliability and consistency.This paper evaluates the performance of various large language models (LLMs) as judges in assessing the quality of responses from other LLMs. The study uses the TriviaQA benchmark to compare nine judge models with nine exam-taker models, including both base and instruction-tuned versions. The researchers assess how well these models align with human judgments and identify potential biases and limitations in using LLMs as judges. Key findings include that while Llama-3 70B and GPT-4 Turbo show excellent alignment with humans, they are outperformed by other models like JudgeLM-7B and the lexical matching method "contains" in ranking exam-taker models. Cohen's kappa is found to be a better metric for measuring alignment than simple percent agreement, as it accounts for chance agreement. However, even models with high kappa scores may have systematic biases. The study also highlights that judge models struggle with ambiguous or under-specified answers and are sensitive to prompt length and specificity. They tend to be lenient in their evaluations and may fail to correctly assess responses that match reference answers verbatim. Additionally, some models are easily fooled by simple responses like "Yes" or "Sure." The research underscores the importance of using multiple metrics and qualitative analysis when evaluating LLMs as judges, as well as the need for caution in relying solely on alignment metrics. The study contributes to the understanding of the strengths and limitations of LLMs as judges and highlights the need for further research to improve their reliability and consistency.
Reach us at info@study.space
[slides and audio] Judging the Judges%3A Evaluating Alignment and Vulnerabilities in LLMs-as-Judges