[slides and audio] Large Language Models are Inconsistent and Biased Evaluators

The paper "Large Language Models are Inconsistent and Biased Evaluators" by Rickard Stureborg, Dimitris Alikaniotis, and Yoshi Suhara explores the limitations of using Large Language Models (LLMs) as evaluators in natural language processing (NLP). The authors conduct a series of analyses using the SummEval dataset to confirm that LLMs exhibit several biases and inconsistencies: 1. **Familiarity Bias**: LLMs tend to prefer texts with lower perplexity, showing a bias towards familiar content. 2. **Skewed and Biased Ratings**: LLMs produce skewed and biased distributions of ratings, with certain scores being assigned more frequently than others. 3. **Anchoring Effects**: LLMs experience anchoring effects when making multi-attribute judgments, where previous ratings influence subsequent ratings. Additionally, the paper finds that LLMs are *inconsistent* evaluators, showing low "inter-sample" agreement and sensitivity to prompt differences that are insignificant to human understanding of text quality. The authors propose recipes to mitigate these limitations and demonstrate improvements over state-of-the-art LLM evaluators on the RoSE dataset. The findings highlight the need for further research to understand and address the biases and inconsistencies in LLM evaluators, particularly in sensitive applications.The paper "Large Language Models are Inconsistent and Biased Evaluators" by Rickard Stureborg, Dimitris Alikaniotis, and Yoshi Suhara explores the limitations of using Large Language Models (LLMs) as evaluators in natural language processing (NLP). The authors conduct a series of analyses using the SummEval dataset to confirm that LLMs exhibit several biases and inconsistencies: 1. **Familiarity Bias**: LLMs tend to prefer texts with lower perplexity, showing a bias towards familiar content. 2. **Skewed and Biased Ratings**: LLMs produce skewed and biased distributions of ratings, with certain scores being assigned more frequently than others. 3. **Anchoring Effects**: LLMs experience anchoring effects when making multi-attribute judgments, where previous ratings influence subsequent ratings. Additionally, the paper finds that LLMs are *inconsistent* evaluators, showing low "inter-sample" agreement and sensitivity to prompt differences that are insignificant to human understanding of text quality. The authors propose recipes to mitigate these limitations and demonstrate improvements over state-of-the-art LLM evaluators on the RoSE dataset. The findings highlight the need for further research to understand and address the biases and inconsistencies in LLM evaluators, particularly in sensitive applications.

Large Language Models are Inconsistent and Biased Evaluators

2 May 2024 | Rickard Stureborg1,2 Dimitris Alikaniotis1 Yoshi Suhara3,*