31 Mar 2024 | Liyan Tang, Igor Shalyminov, Amy Wing-mei Wong, Jon Burnsky, Jake W. Vincent, Yu'an Yang, Siffi Singh, Song Feng, Hwanjun Song, Hang Su, Lijia Sun, Yi Zhang, Saab Mansour, Kathleen McKeown
The paper introduces a new evaluation benchmark, TOFUEVAL, for topic-focused dialogue summarization, focusing on the factual consistency of LLM-generated summaries. The benchmark includes 1,500 summaries from five LLMs of varying sizes, generated from 50 documents sampled from two dialogue summarization datasets: MediaSum and MeetingBank. Human annotators provide binary factuality labels and explanations for each sentence in the summaries, revealing that LLMs make significant factual errors, regardless of their size. The study also finds that LLMs, including GPT-4, perform poorly as binary factual evaluators, outperformed by state-of-the-art non-LLM-based factuality metrics. An error analysis using a curated taxonomy shows that non-LLM metrics capture all error types better than LLM-based evaluators. The paper concludes with insights into the limitations and future directions for improving automated evaluation of summary factuality.The paper introduces a new evaluation benchmark, TOFUEVAL, for topic-focused dialogue summarization, focusing on the factual consistency of LLM-generated summaries. The benchmark includes 1,500 summaries from five LLMs of varying sizes, generated from 50 documents sampled from two dialogue summarization datasets: MediaSum and MeetingBank. Human annotators provide binary factuality labels and explanations for each sentence in the summaries, revealing that LLMs make significant factual errors, regardless of their size. The study also finds that LLMs, including GPT-4, perform poorly as binary factual evaluators, outperformed by state-of-the-art non-LLM-based factuality metrics. An error analysis using a curated taxonomy shows that non-LLM metrics capture all error types better than LLM-based evaluators. The paper concludes with insights into the limitations and future directions for improving automated evaluation of summary factuality.