[slides and audio] FineSurE%3A Fine-grained Summarization Evaluation using LLMs

**FineSurE: Fine-grained Summarization Evaluation using LLMs** **Authors:** Hwanjun Song, Hang Su, Igor Shalyminov, Jason Cai, Saab Mansour **Institution:** Korea Advanced Institute of Science and Technology (KAIST) and AWS AI Labs **Abstract:** Automated evaluation is crucial for streamlining text summarization benchmarking and model development, given the costly and time-consuming nature of human evaluation. Traditional methods like ROUGE do not correlate well with human judgment, while recent LLM-based metrics provide only summary-level assessment using Likert-scale scores. This limits deeper model analysis. To address these limitations, we propose FineSurE, a fine-grained evaluator specifically tailored for the summarization task using large language models (LLMs). It employs completeness and conciseness criteria, in addition to faithfulness, enabling multi-dimensional assessment. We compare various open-source and proprietary LLMs as backbones for FineSurE and conduct extensive benchmarking against state-of-the-art methods, showing improved performance, especially on completeness and conciseness dimensions. **Key Contributions:** 1. We argue that LLM-based summarization suffers from hallucination, information emission, and verbosity, requiring a revisit of evaluation dimensions. 2. We suggest three metrics targeting LLM output characteristics: faithfulness, completeness, and conciseness. 3. We propose FineSurE, a novel automated evaluation framework based on keyfact lists and using LLMs to generate keyfacts, align them with summary sentences, and categorize errors automatically. 4. We compare various open-source and proprietary LLMs to power FineSurE and analyze their correlation with human judgment. 5. We provide comprehensive results comparing FineSurE with similarity-based, NLI-based, QA-based, and LLM-based automated methods, showing improved human correlation over state-of-the-art methods. **Evaluation:** - **Datasets:** FRANK and REALSumm are used for evaluating the automated evaluator's performance. - **LLMs as Evaluators:** GPT-4-turbo is the default LLM, but we test with various open-source and proprietary LLMs. - **Baselines:** FineSurE is compared with similarity-based methods (ROUGE, BERTScore, BARTScore), NLI-based methods (SummaC-Conv), QA-based methods (UniEval, QAFactEval), and the latest LLM-based method (G-Eval). - **Performance:** FineSurE outperforms existing evaluators in faithfulness, completeness, and conciseness at all levels of evaluation, demonstrating strong alignment with human judgments. **Conclusion:** FineSurE is a novel automated evaluator designed for fine-grained and multi-dimensional text summarization evaluation. It provides detailed insights through fact checking and keyfact alignment, offering a promising approach to advancing automated evaluation in text summarization.**FineSurE: Fine-grained Summarization Evaluation using LLMs** **Authors:** Hwanjun Song, Hang Su, Igor Shalyminov, Jason Cai, Saab Mansour **Institution:** Korea Advanced Institute of Science and Technology (KAIST) and AWS AI Labs **Abstract:** Automated evaluation is crucial for streamlining text summarization benchmarking and model development, given the costly and time-consuming nature of human evaluation. Traditional methods like ROUGE do not correlate well with human judgment, while recent LLM-based metrics provide only summary-level assessment using Likert-scale scores. This limits deeper model analysis. To address these limitations, we propose FineSurE, a fine-grained evaluator specifically tailored for the summarization task using large language models (LLMs). It employs completeness and conciseness criteria, in addition to faithfulness, enabling multi-dimensional assessment. We compare various open-source and proprietary LLMs as backbones for FineSurE and conduct extensive benchmarking against state-of-the-art methods, showing improved performance, especially on completeness and conciseness dimensions. **Key Contributions:** 1. We argue that LLM-based summarization suffers from hallucination, information emission, and verbosity, requiring a revisit of evaluation dimensions. 2. We suggest three metrics targeting LLM output characteristics: faithfulness, completeness, and conciseness. 3. We propose FineSurE, a novel automated evaluation framework based on keyfact lists and using LLMs to generate keyfacts, align them with summary sentences, and categorize errors automatically. 4. We compare various open-source and proprietary LLMs to power FineSurE and analyze their correlation with human judgment. 5. We provide comprehensive results comparing FineSurE with similarity-based, NLI-based, QA-based, and LLM-based automated methods, showing improved human correlation over state-of-the-art methods. **Evaluation:** - **Datasets:** FRANK and REALSumm are used for evaluating the automated evaluator's performance. - **LLMs as Evaluators:** GPT-4-turbo is the default LLM, but we test with various open-source and proprietary LLMs. - **Baselines:** FineSurE is compared with similarity-based methods (ROUGE, BERTScore, BARTScore), NLI-based methods (SummaC-Conv), QA-based methods (UniEval, QAFactEval), and the latest LLM-based method (G-Eval). - **Performance:** FineSurE outperforms existing evaluators in faithfulness, completeness, and conciseness at all levels of evaluation, demonstrating strong alignment with human judgments. **Conclusion:** FineSurE is a novel automated evaluator designed for fine-grained and multi-dimensional text summarization evaluation. It provides detailed insights through fact checking and keyfact alignment, offering a promising approach to advancing automated evaluation in text summarization.

FineSurE: Fine-grained Summarization Evaluation using LLMs

22 Jul 2024 | Hwanjun Song, Hang Su, Igor Shalyminov, Jason Cai, Saab Mansour