23 May 2023 | Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, Chenguang Zhu
The paper introduces G-EVAL, a framework that uses large language models (LLMs) with chain-of-thoughts (CoT) and a form-filling paradigm to evaluate the quality of natural language generation (NLG) outputs. G-EVAL aims to address the limitations of traditional reference-based metrics like BLEU and ROUGE, which have low correlations with human judgments, especially for creative tasks. The framework is evaluated on two tasks: text summarization and dialogue generation. Using GPT-4 as the backbone model, G-EVAL achieves a Spearman correlation of 0.514 with human judgments on the summarization task, outperforming previous methods. The paper also analyzes the behavior of LLM-based evaluators, highlighting potential biases towards LLM-generated texts. The main contributions include the effectiveness of LLM-based metrics in correlation with human judgments, the improvement of LLM-based evaluators through CoT, and the ability to provide more fine-grained continuous scores. The paper concludes with a discussion on the potential risks and challenges of using LLMs as evaluators.The paper introduces G-EVAL, a framework that uses large language models (LLMs) with chain-of-thoughts (CoT) and a form-filling paradigm to evaluate the quality of natural language generation (NLG) outputs. G-EVAL aims to address the limitations of traditional reference-based metrics like BLEU and ROUGE, which have low correlations with human judgments, especially for creative tasks. The framework is evaluated on two tasks: text summarization and dialogue generation. Using GPT-4 as the backbone model, G-EVAL achieves a Spearman correlation of 0.514 with human judgments on the summarization task, outperforming previous methods. The paper also analyzes the behavior of LLM-based evaluators, highlighting potential biases towards LLM-generated texts. The main contributions include the effectiveness of LLM-based metrics in correlation with human judgments, the improvement of LLM-based evaluators through CoT, and the ability to provide more fine-grained continuous scores. The paper concludes with a discussion on the potential risks and challenges of using LLMs as evaluators.