LLM-based NLG Evaluation: Current Status and Challenges

LLM-based NLG Evaluation: Current Status and Challenges

26 Feb 2024 | Mingqi Gao*, Xinyu Hu*, Jie Ruan, Xiao Pu, Xiaojun Wan
The paper "LLM-based NLG Evaluation: Current Status and Challenges" by Mingqi Gao, Xinyu Hu, Jie Ruan, Xiao Pu, and Xiaojun Wan from Peking University reviews the state of NLG evaluation using large language models (LLMs). Traditional evaluation metrics like BLEU and ROUGE have been criticized for their low correlation with human judgments, leading to the development of LLM-based methods. The paper categorizes these methods into four types: LLM-derived metrics, prompting LLMs, fine-tuning LLMs, and human-LLM collaborative evaluation. 1. **LLM-derived Metrics**: These methods derive evaluation metrics from LLMs, either through embedding-based or probability-based approaches. While these methods offer better performance and correlation with human judgments, they lack robustness, efficiency, and fairness. 2. **Prompting LLMs**: This approach involves using LLMs to evaluate texts based on prompts that include task instructions, evaluation criteria, input content, and evaluation methods. LLMs can generate explanations, making the evaluation process more interpretable. However, position bias and preferences for longer responses are among the challenges. 3. **Fine-tuning LLMs**: This method involves fine-tuning smaller, open-source LLMs specifically for evaluation tasks. It aims to achieve performance close to GPT-4 while being more reproducible and cost-effective. Despite improvements, biases and computational expenses remain issues. 4. **Human-LLM Collaborative Evaluation**: This approach leverages the strengths of both human evaluators and LLMs to achieve robust and nuanced evaluations. It includes methods like scoring, explaining, testing, and auditing, with ongoing research focusing on improving efficiency and reliability. The paper concludes by discussing future research directions, including the need for unified benchmarks, evaluation in low-resource languages, and diverse forms of human-LLM collaboration.The paper "LLM-based NLG Evaluation: Current Status and Challenges" by Mingqi Gao, Xinyu Hu, Jie Ruan, Xiao Pu, and Xiaojun Wan from Peking University reviews the state of NLG evaluation using large language models (LLMs). Traditional evaluation metrics like BLEU and ROUGE have been criticized for their low correlation with human judgments, leading to the development of LLM-based methods. The paper categorizes these methods into four types: LLM-derived metrics, prompting LLMs, fine-tuning LLMs, and human-LLM collaborative evaluation. 1. **LLM-derived Metrics**: These methods derive evaluation metrics from LLMs, either through embedding-based or probability-based approaches. While these methods offer better performance and correlation with human judgments, they lack robustness, efficiency, and fairness. 2. **Prompting LLMs**: This approach involves using LLMs to evaluate texts based on prompts that include task instructions, evaluation criteria, input content, and evaluation methods. LLMs can generate explanations, making the evaluation process more interpretable. However, position bias and preferences for longer responses are among the challenges. 3. **Fine-tuning LLMs**: This method involves fine-tuning smaller, open-source LLMs specifically for evaluation tasks. It aims to achieve performance close to GPT-4 while being more reproducible and cost-effective. Despite improvements, biases and computational expenses remain issues. 4. **Human-LLM Collaborative Evaluation**: This approach leverages the strengths of both human evaluators and LLMs to achieve robust and nuanced evaluations. It includes methods like scoring, explaining, testing, and auditing, with ongoing research focusing on improving efficiency and reliability. The paper concludes by discussing future research directions, including the need for unified benchmarks, evaluation in low-resource languages, and diverse forms of human-LLM collaboration.
Reach us at info@study.space