31 Mar 2024 | Liyan Tang, Igor Shalyminov, Amy Wing-mei Wong, Jon Burnsky, Jake W. Vincent, Yu'an Yang, Siffi Singh, Song Feng, Hwanjun Song, Hang Su, Lijia Sun, Yi Zhang, Saab Mansour, Kathleen McKeown
The paper introduces TOfUEVAL, a new benchmark for evaluating factual consistency in topic-focused dialogue summarization. The benchmark consists of summaries generated by various LLMs, with human annotations on factual consistency at the sentence level. The study finds that LLMs generate significant factual errors in dialogue summaries, regardless of model size. When used as evaluators, LLMs perform poorly compared to non-LLM-based factuality metrics, which are better at capturing all error types. The analysis reveals that non-LLM metrics outperform LLMs in detecting factual inconsistencies, especially in main-topic summaries. The study also shows that LLMs struggle with detecting errors in dialogue summaries due to the complexity of the task. The results highlight the challenges of evaluating factual consistency in dialogue summarization and the need for more effective evaluation metrics. The paper concludes that while LLMs have potential as evaluators, non-LLM-based metrics are currently more effective in detecting factual inconsistencies. The benchmark provides a valuable resource for further research into automated evaluation of summary factuality.The paper introduces TOfUEVAL, a new benchmark for evaluating factual consistency in topic-focused dialogue summarization. The benchmark consists of summaries generated by various LLMs, with human annotations on factual consistency at the sentence level. The study finds that LLMs generate significant factual errors in dialogue summaries, regardless of model size. When used as evaluators, LLMs perform poorly compared to non-LLM-based factuality metrics, which are better at capturing all error types. The analysis reveals that non-LLM metrics outperform LLMs in detecting factual inconsistencies, especially in main-topic summaries. The study also shows that LLMs struggle with detecting errors in dialogue summaries due to the complexity of the task. The results highlight the challenges of evaluating factual consistency in dialogue summarization and the need for more effective evaluation metrics. The paper concludes that while LLMs have potential as evaluators, non-LLM-based metrics are currently more effective in detecting factual inconsistencies. The benchmark provides a valuable resource for further research into automated evaluation of summary factuality.