[slides and audio] Semantic Entropy Probes%3A Robust and Cheap Hallucination Detection in LLMs

The paper introduces Semantic Entropy Probes (SEPs), a novel method for robust and cost-effective uncertainty quantification in Large Language Models (LLMs). SEPs are designed to address the challenge of hallucinations, which are plausible but factually incorrect outputs generated by LLMs. Unlike previous methods that require sampling multiple model generations to estimate semantic entropy, SEPs directly approximate this measure using the hidden states of a single generation. This approach reduces computational overhead and makes hallucination detection more practical. SEPs are trained as linear logistic regression models, using hidden states from different layers and token positions to capture semantic entropy. The authors demonstrate that SEPs can accurately predict semantic entropy across various tasks and models, and they outperform accuracy probes in detecting hallucinations, especially when tested on novel inputs from different distributions. SEPs also generalize better to new tasks compared to other methods, despite not requiring ground truth model correctness labels. The paper includes extensive experiments on multiple datasets and models, showing that SEPs are effective in capturing semantic entropy and detecting hallucinations. The results suggest that hidden states in LLMs implicitly capture semantic uncertainty, making SEPs a promising tool for improving the reliability and trustworthiness of LLMs.The paper introduces Semantic Entropy Probes (SEPs), a novel method for robust and cost-effective uncertainty quantification in Large Language Models (LLMs). SEPs are designed to address the challenge of hallucinations, which are plausible but factually incorrect outputs generated by LLMs. Unlike previous methods that require sampling multiple model generations to estimate semantic entropy, SEPs directly approximate this measure using the hidden states of a single generation. This approach reduces computational overhead and makes hallucination detection more practical. SEPs are trained as linear logistic regression models, using hidden states from different layers and token positions to capture semantic entropy. The authors demonstrate that SEPs can accurately predict semantic entropy across various tasks and models, and they outperform accuracy probes in detecting hallucinations, especially when tested on novel inputs from different distributions. SEPs also generalize better to new tasks compared to other methods, despite not requiring ground truth model correctness labels. The paper includes extensive experiments on multiple datasets and models, showing that SEPs are effective in capturing semantic entropy and detecting hallucinations. The results suggest that hidden states in LLMs implicitly capture semantic uncertainty, making SEPs a promising tool for improving the reliability and trustworthiness of LLMs.

Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs

22 Jun 2024 | Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth Malik, Yarin Gal