28 Jun 2024 | Xinyu Hu*,1, Mingqi Gao*,1, Sen Hu2, Yang Zhang2 Yicheng Chen2, Teng Xu2, Xiaojun Wan1
The paper investigates the reliability of large language models (LLMs) in evaluating natural language generation (NLG) tasks, particularly focusing on their ability to distinguish between different aspects of quality. The authors identify two main issues: inconsistent conceptualization and ambiguous expression in existing NLG quality criteria, which can lead to confusion in LLM evaluations. To address these issues, they propose a clear hierarchical classification system for 11 common aspects and design 18 aspect-targeted perturbation attacks to fine-grainly analyze LLMs' evaluation behaviors. Human annotations are also conducted to validate the impact of these perturbations. The experimental results reveal significant confusion issues in LLM-based evaluations, even for powerful models like GPT-4, highlighting the need for further research and improvements in LLM-based NLG evaluation. The paper contributes to the understanding of LLMs' capabilities and limitations in NLG evaluation, providing insights into the reliability of LLM-based approaches.The paper investigates the reliability of large language models (LLMs) in evaluating natural language generation (NLG) tasks, particularly focusing on their ability to distinguish between different aspects of quality. The authors identify two main issues: inconsistent conceptualization and ambiguous expression in existing NLG quality criteria, which can lead to confusion in LLM evaluations. To address these issues, they propose a clear hierarchical classification system for 11 common aspects and design 18 aspect-targeted perturbation attacks to fine-grainly analyze LLMs' evaluation behaviors. Human annotations are also conducted to validate the impact of these perturbations. The experimental results reveal significant confusion issues in LLM-based evaluations, even for powerful models like GPT-4, highlighting the need for further research and improvements in LLM-based NLG evaluation. The paper contributes to the understanding of LLMs' capabilities and limitations in NLG evaluation, providing insights into the reliability of LLM-based approaches.