22 Jul 2024 | Hwanjun Song, Hang Su, Igor Shalyminov, Jason Cai, Saab Mansour
**FineSurE: Fine-grained Summarization Evaluation using LLMs**
**Authors:** Hwanjun Song, Hang Su, Igor Shalyminov, Jason Cai, Saab Mansour
**Institution:** Korea Advanced Institute of Science and Technology (KAIST) and AWS AI Labs
**Abstract:**
Automated evaluation is crucial for streamlining text summarization benchmarking and model development, given the costly and time-consuming nature of human evaluation. Traditional methods like ROUGE do not correlate well with human judgment, while recent LLM-based metrics provide only summary-level assessment using Likert-scale scores. This limits deeper model analysis. To address these limitations, we propose FineSurE, a fine-grained evaluator specifically tailored for the summarization task using large language models (LLMs). It employs completeness and conciseness criteria, in addition to faithfulness, enabling multi-dimensional assessment. We compare various open-source and proprietary LLMs as backbones for FineSurE and conduct extensive benchmarking against state-of-the-art methods, showing improved performance, especially on completeness and conciseness dimensions.
**Key Contributions:**
1. We argue that LLM-based summarization suffers from hallucination, information emission, and verbosity, requiring a revisit of evaluation dimensions.
2. We suggest three metrics targeting LLM output characteristics: faithfulness, completeness, and conciseness.
3. We propose FineSurE, a novel automated evaluation framework based on keyfact lists and using LLMs to generate keyfacts, align them with summary sentences, and categorize errors automatically.
4. We compare various open-source and proprietary LLMs to power FineSurE and analyze their correlation with human judgment.
5. We provide comprehensive results comparing FineSurE with similarity-based, NLI-based, QA-based, and LLM-based automated methods, showing improved human correlation over state-of-the-art methods.
**Evaluation:**
- **Datasets:** FRANK and REALSumm are used for evaluating the automated evaluator's performance.
- **LLMs as Evaluators:** GPT-4-turbo is the default LLM, but we test with various open-source and proprietary LLMs.
- **Baselines:** FineSurE is compared with similarity-based methods (ROUGE, BERTScore, BARTScore), NLI-based methods (SummaC-Conv), QA-based methods (UniEval, QAFactEval), and the latest LLM-based method (G-Eval).
- **Performance:** FineSurE outperforms existing evaluators in faithfulness, completeness, and conciseness at all levels of evaluation, demonstrating strong alignment with human judgments.
**Conclusion:**
FineSurE is a novel automated evaluator designed for fine-grained and multi-dimensional text summarization evaluation. It provides detailed insights through fact checking and keyfact alignment, offering a promising approach to advancing automated evaluation in text summarization.**FineSurE: Fine-grained Summarization Evaluation using LLMs**
**Authors:** Hwanjun Song, Hang Su, Igor Shalyminov, Jason Cai, Saab Mansour
**Institution:** Korea Advanced Institute of Science and Technology (KAIST) and AWS AI Labs
**Abstract:**
Automated evaluation is crucial for streamlining text summarization benchmarking and model development, given the costly and time-consuming nature of human evaluation. Traditional methods like ROUGE do not correlate well with human judgment, while recent LLM-based metrics provide only summary-level assessment using Likert-scale scores. This limits deeper model analysis. To address these limitations, we propose FineSurE, a fine-grained evaluator specifically tailored for the summarization task using large language models (LLMs). It employs completeness and conciseness criteria, in addition to faithfulness, enabling multi-dimensional assessment. We compare various open-source and proprietary LLMs as backbones for FineSurE and conduct extensive benchmarking against state-of-the-art methods, showing improved performance, especially on completeness and conciseness dimensions.
**Key Contributions:**
1. We argue that LLM-based summarization suffers from hallucination, information emission, and verbosity, requiring a revisit of evaluation dimensions.
2. We suggest three metrics targeting LLM output characteristics: faithfulness, completeness, and conciseness.
3. We propose FineSurE, a novel automated evaluation framework based on keyfact lists and using LLMs to generate keyfacts, align them with summary sentences, and categorize errors automatically.
4. We compare various open-source and proprietary LLMs to power FineSurE and analyze their correlation with human judgment.
5. We provide comprehensive results comparing FineSurE with similarity-based, NLI-based, QA-based, and LLM-based automated methods, showing improved human correlation over state-of-the-art methods.
**Evaluation:**
- **Datasets:** FRANK and REALSumm are used for evaluating the automated evaluator's performance.
- **LLMs as Evaluators:** GPT-4-turbo is the default LLM, but we test with various open-source and proprietary LLMs.
- **Baselines:** FineSurE is compared with similarity-based methods (ROUGE, BERTScore, BARTScore), NLI-based methods (SummaC-Conv), QA-based methods (UniEval, QAFactEval), and the latest LLM-based method (G-Eval).
- **Performance:** FineSurE outperforms existing evaluators in faithfulness, completeness, and conciseness at all levels of evaluation, demonstrating strong alignment with human judgments.
**Conclusion:**
FineSurE is a novel automated evaluator designed for fine-grained and multi-dimensional text summarization evaluation. It provides detailed insights through fact checking and keyfact alignment, offering a promising approach to advancing automated evaluation in text summarization.