Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs

Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs

22 Jun 2024 | Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth Malik, Yarin Gal
Semantic Entropy Probes (SEPs) are a cost-effective and reliable method for detecting hallucinations in Large Language Models (LLMs). Hallucinations, which are factually incorrect and arbitrary model outputs, pose a significant challenge for LLM practical adoption. SEPs approximate semantic entropy from hidden states of a single model generation, avoiding the high computational cost of previous methods that require multiple generations. SEPs are simple to train and do not need multiple model generations at test time, making them highly efficient. They outperform accuracy probes in hallucination detection and generalize better to out-of-distribution data. The hidden states of LLMs capture semantic entropy, and ablation studies show this holds across various models, tasks, layers, and token positions. SEPs are trained to predict semantic entropy rather than model accuracy, which allows them to generalize better without access to ground truth labels. Experiments show that SEPs are effective at detecting hallucinations and outperform accuracy probes in generalization. SEPs are a promising approach for cost-effective uncertainty quantification in LLMs.Semantic Entropy Probes (SEPs) are a cost-effective and reliable method for detecting hallucinations in Large Language Models (LLMs). Hallucinations, which are factually incorrect and arbitrary model outputs, pose a significant challenge for LLM practical adoption. SEPs approximate semantic entropy from hidden states of a single model generation, avoiding the high computational cost of previous methods that require multiple generations. SEPs are simple to train and do not need multiple model generations at test time, making them highly efficient. They outperform accuracy probes in hallucination detection and generalize better to out-of-distribution data. The hidden states of LLMs capture semantic entropy, and ablation studies show this holds across various models, tasks, layers, and token positions. SEPs are trained to predict semantic entropy rather than model accuracy, which allows them to generalize better without access to ground truth labels. Experiments show that SEPs are effective at detecting hallucinations and outperform accuracy probes in generalization. SEPs are a promising approach for cost-effective uncertainty quantification in LLMs.
Reach us at info@study.space