16 Jan 2024 | Junliang Luo, Tianyu Li, Di Wu, Michael Jenkin, Steve Liu, Gregory Dudek
This paper investigates hallucination detection and mitigation in large language models (LLMs). LLMs, such as ChatGPT, Bard, and Llama, have shown great success in various applications, but they are prone to hallucinations—generating responses that appear correct but are factually incorrect. Detecting and mitigating hallucinations is crucial for applying LLMs to real-world tasks. The paper reviews current approaches to detect and mitigate hallucinations, focusing on classification and natural language generation (NLG) metrics.
Common classification metrics include accuracy, precision, recall, F-score, AUC, BSS, and G-mean. NLG metrics include n-grams, BLEU, ROUGE, METEOR, BERTScore, BARTScore, BLEURT, FEQA, and QUESTEVAL. These metrics evaluate the similarity between generated text and reference text, with BLEU and ROUGE being widely used for summarization tasks.
The paper discusses two main types of hallucination detection: token-level and sentence-level. Token-level detection identifies specific tokens or named entities that may be hallucinated, while sentence-level detection identifies entire hallucinated sentences. Several studies are reviewed, including HADES, which proposes a token-level reference-free hallucination detection benchmark. NPH focuses on reducing hallucinations in dialogue systems through path grounding. The paper also discusses the factual nature of some hallucinations in summarization and proposes a method to distinguish factual from non-factual hallucinations.
Sentence-level detection methods include SelfCheckGPT, which uses self-consistency to detect hallucinations by comparing responses from the same LLM with different temperatures. ALIGNSCORE is a new metric that evaluates factual consistency using a unified alignment function across various tasks. ExHalder is a framework for detecting hallucinations in news headlines by combining reasoning, hinted classification, and explanation generation. HArIM+ is a reference-free metric for evaluating summary quality based on hallucination risk.
The paper highlights the importance of context in hallucination detection and the need for models that can capture nuanced semantic distinctions between the source text and the generated summary. It also emphasizes the challenges of hallucination detection in real-world applications and the need for robust, efficient methods to address these challenges.This paper investigates hallucination detection and mitigation in large language models (LLMs). LLMs, such as ChatGPT, Bard, and Llama, have shown great success in various applications, but they are prone to hallucinations—generating responses that appear correct but are factually incorrect. Detecting and mitigating hallucinations is crucial for applying LLMs to real-world tasks. The paper reviews current approaches to detect and mitigate hallucinations, focusing on classification and natural language generation (NLG) metrics.
Common classification metrics include accuracy, precision, recall, F-score, AUC, BSS, and G-mean. NLG metrics include n-grams, BLEU, ROUGE, METEOR, BERTScore, BARTScore, BLEURT, FEQA, and QUESTEVAL. These metrics evaluate the similarity between generated text and reference text, with BLEU and ROUGE being widely used for summarization tasks.
The paper discusses two main types of hallucination detection: token-level and sentence-level. Token-level detection identifies specific tokens or named entities that may be hallucinated, while sentence-level detection identifies entire hallucinated sentences. Several studies are reviewed, including HADES, which proposes a token-level reference-free hallucination detection benchmark. NPH focuses on reducing hallucinations in dialogue systems through path grounding. The paper also discusses the factual nature of some hallucinations in summarization and proposes a method to distinguish factual from non-factual hallucinations.
Sentence-level detection methods include SelfCheckGPT, which uses self-consistency to detect hallucinations by comparing responses from the same LLM with different temperatures. ALIGNSCORE is a new metric that evaluates factual consistency using a unified alignment function across various tasks. ExHalder is a framework for detecting hallucinations in news headlines by combining reasoning, hinted classification, and explanation generation. HArIM+ is a reference-free metric for evaluating summary quality based on hallucination risk.
The paper highlights the importance of context in hallucination detection and the need for models that can capture nuanced semantic distinctions between the source text and the generated summary. It also emphasizes the challenges of hallucination detection in real-world applications and the need for robust, efficient methods to address these challenges.