28 Jun 2024 | Xinyu Hu, Mingqi Gao, Sen Hu, Yang Zhang, Yicheng Chen, Teng Xu, Xiaojun Wan
This paper investigates whether large language models (LLMs) can accurately distinguish different aspects of natural language generation (NLG) quality and whether they confuse different evaluation criteria. The authors find that LLMs often confuse different aspects of NLG quality, leading to unreliable evaluations. To address this, they first analyze the existing NLG quality criteria and summarize a clear hierarchical classification system for 11 common aspects. Inspired by behavioral testing, they design 18 aspect-targeted perturbation attacks to conduct fine-grained analysis of LLM evaluation behaviors. They also conduct human annotations to validate the impact of these perturbations. Their experimental results reveal that LLMs, including GPT-3.5 and GPT-4, often confuse different aspects of NLG quality, leading to inconsistent evaluation results. The authors also find that the different levels of detail in criterion descriptions do not significantly affect LLM evaluation behaviors. They propose a classification system for aspects and design perturbation attacks to test LLM evaluation behaviors. They also conduct experiments with different LLMs, including GPT-3.5, GPT-4, and Prometheus, and find that all of them have issues with confusing different aspects of NLG quality. The authors conclude that LLM-based evaluations are not reliable across different aspects of NLG quality and that further research is needed to improve the reliability of LLM-based evaluations.This paper investigates whether large language models (LLMs) can accurately distinguish different aspects of natural language generation (NLG) quality and whether they confuse different evaluation criteria. The authors find that LLMs often confuse different aspects of NLG quality, leading to unreliable evaluations. To address this, they first analyze the existing NLG quality criteria and summarize a clear hierarchical classification system for 11 common aspects. Inspired by behavioral testing, they design 18 aspect-targeted perturbation attacks to conduct fine-grained analysis of LLM evaluation behaviors. They also conduct human annotations to validate the impact of these perturbations. Their experimental results reveal that LLMs, including GPT-3.5 and GPT-4, often confuse different aspects of NLG quality, leading to inconsistent evaluation results. The authors also find that the different levels of detail in criterion descriptions do not significantly affect LLM evaluation behaviors. They propose a classification system for aspects and design perturbation attacks to test LLM evaluation behaviors. They also conduct experiments with different LLMs, including GPT-3.5, GPT-4, and Prometheus, and find that all of them have issues with confusing different aspects of NLG quality. The authors conclude that LLM-based evaluations are not reliable across different aspects of NLG quality and that further research is needed to improve the reliability of LLM-based evaluations.