The paper "Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges" by Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, and Dieuwke Hupkes explores the use of large language models (LLMs) as judges to evaluate other LLMs. The authors focus on a controlled setup using the TriviaQA dataset, where human annotations serve as ground truth, allowing for a high inter-annotator agreement of 96%. They evaluate nine judge models and nine exam-taker models, both base and instruction-tuned, to assess their alignment with human judgments.
Key findings include:
- Only the best models, such as GPT-4 Turbo and Llama-3 70B, show excellent alignment with humans, though even these models fall short of human performance.
- Cohen’s kappa is a more robust measure of alignment than percent agreement, as it accounts for chance agreement.
- Even with high alignment, judges like GPT-4 Turbo and Llama-3 70B still struggle with under-specified answers and tend to be lenient, affecting their consistency.
- Error analysis reveals that better-aligned models have higher recall rates but lower precision.
- Judge models are sensitive to prompt length and specificity, with larger models being more consistent across different prompts.
- Some judge models can be easily fooled by dummy answers, such as "Yes" and "Sure."
The study highlights the strengths and weaknesses of using LLMs as judges, emphasizing the need for caution and further research to understand their limitations in various scenarios. The authors recommend using both percent agreement and Cohen’s kappa to ensure more reliable evaluations.The paper "Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges" by Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, and Dieuwke Hupkes explores the use of large language models (LLMs) as judges to evaluate other LLMs. The authors focus on a controlled setup using the TriviaQA dataset, where human annotations serve as ground truth, allowing for a high inter-annotator agreement of 96%. They evaluate nine judge models and nine exam-taker models, both base and instruction-tuned, to assess their alignment with human judgments.
Key findings include:
- Only the best models, such as GPT-4 Turbo and Llama-3 70B, show excellent alignment with humans, though even these models fall short of human performance.
- Cohen’s kappa is a more robust measure of alignment than percent agreement, as it accounts for chance agreement.
- Even with high alignment, judges like GPT-4 Turbo and Llama-3 70B still struggle with under-specified answers and tend to be lenient, affecting their consistency.
- Error analysis reveals that better-aligned models have higher recall rates but lower precision.
- Judge models are sensitive to prompt length and specificity, with larger models being more consistent across different prompts.
- Some judge models can be easily fooled by dummy answers, such as "Yes" and "Sure."
The study highlights the strengths and weaknesses of using LLMs as judges, emphasizing the need for caution and further research to understand their limitations in various scenarios. The authors recommend using both percent agreement and Cohen’s kappa to ensure more reliable evaluations.