G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment

G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment

23 May 2023 | Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, Chenguang Zhu
G-EVAL is a framework that uses large language models (LLMs) with chain-of-thoughts (CoT) and a form-filling paradigm to evaluate the quality of natural language generation (NLG) outputs. The framework leverages LLMs to generate detailed evaluation steps and then uses these steps to assess the quality of generated texts. G-EVAL was tested on two NLG tasks: text summarization and dialogue generation. On the summarization task, G-EVAL with GPT-4 as the backbone model achieved a Spearman correlation of 0.514 with human judgments, outperforming all previous methods. The framework also includes a scoring function that uses the probabilities of output tokens to refine the final metric, providing more fine-grained, continuous scores. The paper highlights potential issues with LLM-based evaluators, such as a bias towards LLM-generated texts. Experiments show that G-EVAL outperforms existing evaluators in terms of correlation with human judgments. The framework also demonstrates that larger models, like GPT-4, perform better in summarization tasks. Additionally, the use of CoT improves the performance of LLM-based evaluators by providing more context and guidance. Probability normalization further enhances the evaluation by producing more accurate and continuous scores. The paper also discusses the limitations of traditional reference-based metrics, which often have low correlation with human judgments, especially for open-ended and creative tasks. LLM-based evaluators, while promising, may have biases that could lead to self-reinforcement if used as reward signals. The study provides a comprehensive analysis of the behavior of LLM-based evaluators and their potential issues. Overall, G-EVAL represents a significant advancement in NLG evaluation, offering a more effective and reliable framework for assessing the quality of generated texts.G-EVAL is a framework that uses large language models (LLMs) with chain-of-thoughts (CoT) and a form-filling paradigm to evaluate the quality of natural language generation (NLG) outputs. The framework leverages LLMs to generate detailed evaluation steps and then uses these steps to assess the quality of generated texts. G-EVAL was tested on two NLG tasks: text summarization and dialogue generation. On the summarization task, G-EVAL with GPT-4 as the backbone model achieved a Spearman correlation of 0.514 with human judgments, outperforming all previous methods. The framework also includes a scoring function that uses the probabilities of output tokens to refine the final metric, providing more fine-grained, continuous scores. The paper highlights potential issues with LLM-based evaluators, such as a bias towards LLM-generated texts. Experiments show that G-EVAL outperforms existing evaluators in terms of correlation with human judgments. The framework also demonstrates that larger models, like GPT-4, perform better in summarization tasks. Additionally, the use of CoT improves the performance of LLM-based evaluators by providing more context and guidance. Probability normalization further enhances the evaluation by producing more accurate and continuous scores. The paper also discusses the limitations of traditional reference-based metrics, which often have low correlation with human judgments, especially for open-ended and creative tasks. LLM-based evaluators, while promising, may have biases that could lead to self-reinforcement if used as reward signals. The study provides a comprehensive analysis of the behavior of LLM-based evaluators and their potential issues. Overall, G-EVAL represents a significant advancement in NLG evaluation, offering a more effective and reliable framework for assessing the quality of generated texts.
Reach us at info@study.space