2 May 2024 | Rickard Stureborg¹², Dimitris Alikaniotis¹, Yoshi Suhara³,*
Large Language Models (LLMs) are biased and inconsistent evaluators for text summarization. This study analyzes the performance of LLMs on the SummEval and RoSE datasets, revealing several issues. LLMs exhibit familiarity bias, preferring texts with lower perplexity. They also show skewed rating distributions and anchoring effects in multi-attribute judgments. LLMs are inconsistent, showing low inter-sample agreement and sensitivity to prompt differences. The study proposes recipes to mitigate these issues, leading to improved performance on the RoSE dataset compared to state-of-the-art methods.
The study finds that LLMs have a familiarity bias, favoring texts they are familiar with, as evidenced by lower perplexity scores for high-rated summaries. LLMs also show score bias, with a tendency to assign scores in round numbers. Anchoring effects are observed, where previous scores influence subsequent judgments. LLMs are inconsistent, with low agreement between different samples and sensitivity to prompt variations.
The study proposes methods to improve LLM evaluator performance, including increasing scoring granularity and using specific prompt configurations. Experiments show that these methods significantly improve performance on the RoSE dataset. The study also finds that LLMs are sensitive to temperature settings and CoT prompting, with higher temperatures improving performance for CoT prompts but reducing performance for non-CoT prompts.
LLMs are also sensitive to the source document, with performance on fluency dropping when the source document is removed. The study evaluates its system on the RoSE dataset, finding that it outperforms existing methods on the CNNDM and SAMSum partitions. The study concludes that LLM evaluators have significant biases and inconsistencies, and that further research is needed to address these issues. The study also highlights the limitations of relying on GPT-based models and the need for further research on other LLMs.Large Language Models (LLMs) are biased and inconsistent evaluators for text summarization. This study analyzes the performance of LLMs on the SummEval and RoSE datasets, revealing several issues. LLMs exhibit familiarity bias, preferring texts with lower perplexity. They also show skewed rating distributions and anchoring effects in multi-attribute judgments. LLMs are inconsistent, showing low inter-sample agreement and sensitivity to prompt differences. The study proposes recipes to mitigate these issues, leading to improved performance on the RoSE dataset compared to state-of-the-art methods.
The study finds that LLMs have a familiarity bias, favoring texts they are familiar with, as evidenced by lower perplexity scores for high-rated summaries. LLMs also show score bias, with a tendency to assign scores in round numbers. Anchoring effects are observed, where previous scores influence subsequent judgments. LLMs are inconsistent, with low agreement between different samples and sensitivity to prompt variations.
The study proposes methods to improve LLM evaluator performance, including increasing scoring granularity and using specific prompt configurations. Experiments show that these methods significantly improve performance on the RoSE dataset. The study also finds that LLMs are sensitive to temperature settings and CoT prompting, with higher temperatures improving performance for CoT prompts but reducing performance for non-CoT prompts.
LLMs are also sensitive to the source document, with performance on fluency dropping when the source document is removed. The study evaluates its system on the RoSE dataset, finding that it outperforms existing methods on the CNNDM and SAMSum partitions. The study concludes that LLM evaluators have significant biases and inconsistencies, and that further research is needed to address these issues. The study also highlights the limitations of relying on GPT-based models and the need for further research on other LLMs.