Analyzing LLM Behavior in Dialogue Summarization: Unveiling Circumstantial Hallucination Trends

Analyzing LLM Behavior in Dialogue Summarization: Unveiling Circumstantial Hallucination Trends

5 Jun 2024 | Sanjana Ramprasad, Elisa Ferracane, Zachary C. Lipton
This paper investigates the behavior of large language models (LLMs) in dialogue summarization, focusing on identifying and categorizing span-level inconsistencies. The study evaluates the faithfulness of two prominent LLMs, GPT-4 and Alpaca-13B, against human annotations and existing fine-tuned models. The research reveals that LLMs often generate plausible inferences based on circumstantial evidence in dialogues, which lack direct support, a pattern less common in older models. This behavior is categorized as "Circumstantial Inference," a new error type introduced in the study. The study compares the performance of LLMs and fine-tuned models across two dialogue summarization datasets, SAMSum and DialogSum. Results show that while LLMs generally have fewer inconsistencies compared to fine-tuned models, over 30% of LLM-generated summaries contain inconsistencies, highlighting the need for improved error detection methods. The study introduces two prompt-based approaches for fine-grained error detection that outperform existing metrics, particularly in identifying "Circumstantial Inference." The paper also evaluates the effectiveness of automatic error detection methods on LLM summaries and finds that they struggle to detect nuanced errors. The study proposes a refined taxonomy of errors, including the new category of "Circumstantial Inference," and compares the performance of different error detection metrics. The results show that prompt-based methods, such as ChatGPT-Span and ChatGPT-SpanMoE, outperform traditional metrics in detecting errors, especially in identifying circumstantial inferences. The study concludes that while LLMs show improved performance in dialogue summarization, they still face challenges in detecting certain types of errors. The research emphasizes the importance of developing more effective error detection methods to improve the accuracy and reliability of LLM-generated summaries. The study also highlights the need for further research to understand the behavior of LLMs in dialogue summarization and to develop better evaluation metrics for assessing their performance.This paper investigates the behavior of large language models (LLMs) in dialogue summarization, focusing on identifying and categorizing span-level inconsistencies. The study evaluates the faithfulness of two prominent LLMs, GPT-4 and Alpaca-13B, against human annotations and existing fine-tuned models. The research reveals that LLMs often generate plausible inferences based on circumstantial evidence in dialogues, which lack direct support, a pattern less common in older models. This behavior is categorized as "Circumstantial Inference," a new error type introduced in the study. The study compares the performance of LLMs and fine-tuned models across two dialogue summarization datasets, SAMSum and DialogSum. Results show that while LLMs generally have fewer inconsistencies compared to fine-tuned models, over 30% of LLM-generated summaries contain inconsistencies, highlighting the need for improved error detection methods. The study introduces two prompt-based approaches for fine-grained error detection that outperform existing metrics, particularly in identifying "Circumstantial Inference." The paper also evaluates the effectiveness of automatic error detection methods on LLM summaries and finds that they struggle to detect nuanced errors. The study proposes a refined taxonomy of errors, including the new category of "Circumstantial Inference," and compares the performance of different error detection metrics. The results show that prompt-based methods, such as ChatGPT-Span and ChatGPT-SpanMoE, outperform traditional metrics in detecting errors, especially in identifying circumstantial inferences. The study concludes that while LLMs show improved performance in dialogue summarization, they still face challenges in detecting certain types of errors. The research emphasizes the importance of developing more effective error detection methods to improve the accuracy and reliability of LLM-generated summaries. The study also highlights the need for further research to understand the behavior of LLMs in dialogue summarization and to develop better evaluation metrics for assessing their performance.
Reach us at info@study.space
[slides and audio] Analyzing LLM Behavior in Dialogue Summarization%3A Unveiling Circumstantial Hallucination Trends