On Faithfulness and Factuality in Abstractive Summarization

On Faithfulness and Factuality in Abstractive Summarization

2 May 2020 | Joshua Maynez*, Shashi Narayan*, Bernd Bohnet, Ryan McDonald
This paper investigates the issue of hallucination in abstractive document summarization. It shows that current neural text generation models, while effective at producing fluent and coherent summaries, are prone to generating content that is unfaithful to the input document. A large-scale human evaluation of several abstractive summarization systems was conducted to better understand the types of hallucinations they produce. The results show that while pretrained models perform better in terms of raw metrics like ROUGE, they still generate a significant amount of hallucinated content. The study also finds that textual entailment measures better correlate with faithfulness than standard metrics, suggesting that automatic evaluation metrics and training criteria could be improved by incorporating such measures. The paper distinguishes between intrinsic and extrinsic hallucinations. Intrinsic hallucinations occur when the model synthesizes content from the input document, while extrinsic hallucinations occur when the model adds information not directly inferable from the input. The study finds that over 90% of extrinsic hallucinations are erroneous, indicating that most hallucinations are neither faithful nor factual. However, models initialized with pretrained parameters perform best in terms of both automatic metrics and human judgments of faithfulness and factuality. Furthermore, they have the highest percentage of extrinsic hallucinations that are factual. The paper also explores the issue of factual hallucinations in summarization. It finds that while some hallucinations may be factually correct, they are still unfaithful to the input document. The study shows that pretrained models are better at generating factual summaries, as they are more aware of the domain of the document and less prone to language model vulnerabilities. However, over 90% of hallucinations in BERTS2S are still erroneous. The paper concludes that while pretrained models are better at generating faithful and factual summaries, the issue of hallucination remains a significant challenge in abstractive summarization. The study suggests that measures based on semantic inference, such as textual entailment, are better representations of true summarization quality than standard metrics like ROUGE or BERTScore. The paper also highlights the importance of using human evaluations to assess the faithfulness and factuality of summaries, as these are not well captured by automatic metrics.This paper investigates the issue of hallucination in abstractive document summarization. It shows that current neural text generation models, while effective at producing fluent and coherent summaries, are prone to generating content that is unfaithful to the input document. A large-scale human evaluation of several abstractive summarization systems was conducted to better understand the types of hallucinations they produce. The results show that while pretrained models perform better in terms of raw metrics like ROUGE, they still generate a significant amount of hallucinated content. The study also finds that textual entailment measures better correlate with faithfulness than standard metrics, suggesting that automatic evaluation metrics and training criteria could be improved by incorporating such measures. The paper distinguishes between intrinsic and extrinsic hallucinations. Intrinsic hallucinations occur when the model synthesizes content from the input document, while extrinsic hallucinations occur when the model adds information not directly inferable from the input. The study finds that over 90% of extrinsic hallucinations are erroneous, indicating that most hallucinations are neither faithful nor factual. However, models initialized with pretrained parameters perform best in terms of both automatic metrics and human judgments of faithfulness and factuality. Furthermore, they have the highest percentage of extrinsic hallucinations that are factual. The paper also explores the issue of factual hallucinations in summarization. It finds that while some hallucinations may be factually correct, they are still unfaithful to the input document. The study shows that pretrained models are better at generating factual summaries, as they are more aware of the domain of the document and less prone to language model vulnerabilities. However, over 90% of hallucinations in BERTS2S are still erroneous. The paper concludes that while pretrained models are better at generating faithful and factual summaries, the issue of hallucination remains a significant challenge in abstractive summarization. The study suggests that measures based on semantic inference, such as textual entailment, are better representations of true summarization quality than standard metrics like ROUGE or BERTScore. The paper also highlights the importance of using human evaluations to assess the faithfulness and factuality of summaries, as these are not well captured by automatic metrics.
Reach us at info@study.space
[slides] On Faithfulness and Factuality in Abstractive Summarization | StudySpace