Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models

Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models

10 Jun 2024 | Weihang Su, Changyue Wang, Qingyao Ai, Yiran Hu, Zhijing Wu, Yujia Zhou, Yiqun Liu
This paper introduces MIND, an unsupervised training framework for real-time hallucination detection in large language models (LLMs), and HELM, a new benchmark for evaluating hallucination detection across multiple LLMs. Hallucination refers to the phenomenon where LLMs generate responses that are logically coherent but factually inaccurate. Existing methods for detecting hallucinations are often post-processing techniques that are computationally intensive and limited in effectiveness. MIND leverages the internal states of LLMs to detect hallucinations in real-time without requiring manual annotations. HELM provides a comprehensive benchmark with diverse LLM outputs and internal states during inference. The experiments show that MIND outperforms existing state-of-the-art methods in hallucination detection. MIND is a lightweight framework that can be integrated into any existing Transformer-based LLM. The framework consists of two steps: automatic training data generation and hallucination classifier training. The training data is generated by truncating Wikipedia articles and using the LLM to generate continuation text. The hallucination classifier is trained using contextualized embeddings of the LLM's internal states. The classifier is a multilayer perceptron that uses the last token's contextualized embedding of the final layer for detection. The framework is evaluated on the HELM benchmark, which includes six different LLMs and their internal states during text generation. The results show that MIND is effective and efficient for real-time hallucination detection. The paper also discusses the limitations of the approach and future work.This paper introduces MIND, an unsupervised training framework for real-time hallucination detection in large language models (LLMs), and HELM, a new benchmark for evaluating hallucination detection across multiple LLMs. Hallucination refers to the phenomenon where LLMs generate responses that are logically coherent but factually inaccurate. Existing methods for detecting hallucinations are often post-processing techniques that are computationally intensive and limited in effectiveness. MIND leverages the internal states of LLMs to detect hallucinations in real-time without requiring manual annotations. HELM provides a comprehensive benchmark with diverse LLM outputs and internal states during inference. The experiments show that MIND outperforms existing state-of-the-art methods in hallucination detection. MIND is a lightweight framework that can be integrated into any existing Transformer-based LLM. The framework consists of two steps: automatic training data generation and hallucination classifier training. The training data is generated by truncating Wikipedia articles and using the LLM to generate continuation text. The hallucination classifier is trained using contextualized embeddings of the LLM's internal states. The classifier is a multilayer perceptron that uses the last token's contextualized embedding of the final layer for detection. The framework is evaluated on the HELM benchmark, which includes six different LLMs and their internal states during text generation. The results show that MIND is effective and efficient for real-time hallucination detection. The paper also discusses the limitations of the approach and future work.
Reach us at info@study.space
[slides] Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models | StudySpace