[slides and audio] Analyzing LLM Behavior in Dialogue Summarization%3A Unveiling Circumstantial Hallucination Trends

This paper addresses the challenges of hallucination in dialogue summarization using large language models (LLMs). The authors evaluate the faithfulness of LLMs, specifically GPT-4 and Alpaca-13B, by collecting fine-grained human annotations to identify and categorize span-level inconsistencies. They introduce a refined taxonomy of errors, including a new category called "Circumstantial Inference," which captures plausible inferences supported by circumstantial evidence but lacking direct evidence. The study reveals that over 30% of LLM-generated summaries contain inconsistencies, significantly higher than the 5% rate in GPT-generated news summaries. The authors also compare the performance of LLMs and fine-tuned models in dialogue summarization, finding that LLMs have fewer inconsistencies but still exhibit a notable prevalence of Circumstantial Inferences. Additionally, they assess the efficacy of automatic error detection methods and introduce two prompt-based approaches for fine-grained error detection, which outperform existing metrics, particularly in identifying Circumstantial Inferences. The paper concludes with a discussion of the limitations and ethical considerations of the study.This paper addresses the challenges of hallucination in dialogue summarization using large language models (LLMs). The authors evaluate the faithfulness of LLMs, specifically GPT-4 and Alpaca-13B, by collecting fine-grained human annotations to identify and categorize span-level inconsistencies. They introduce a refined taxonomy of errors, including a new category called "Circumstantial Inference," which captures plausible inferences supported by circumstantial evidence but lacking direct evidence. The study reveals that over 30% of LLM-generated summaries contain inconsistencies, significantly higher than the 5% rate in GPT-generated news summaries. The authors also compare the performance of LLMs and fine-tuned models in dialogue summarization, finding that LLMs have fewer inconsistencies but still exhibit a notable prevalence of Circumstantial Inferences. Additionally, they assess the efficacy of automatic error detection methods and introduce two prompt-based approaches for fine-grained error detection, which outperform existing metrics, particularly in identifying Circumstantial Inferences. The paper concludes with a discussion of the limitations and ethical considerations of the study.

Analyzing LLM Behavior in Dialogue Summarization: Unveiling Circumstantial Hallucination Trends

5 Jun 2024 | Sanjana Ramprasad, Elisa Ferracane, Zachary C. Lipton