July 18, 2024 | Yasin Abbasi Yadkori, Ilja Kuzborskij, András György, Csaba Szepesvári
The paper explores uncertainty quantification in large language models (LLMs) to identify when responses are unreliable due to epistemic uncertainty, which arises from a lack of knowledge about the ground truth. The authors introduce an information-theoretic metric to detect when only epistemic uncertainty is large, which can be computed based on iterative prompting of the LLM. This metric is insensitive to aleatoric uncertainty, allowing for the detection of hallucinations in both single- and multi-answer responses. The paper also presents a finite-sample estimator for mutual information (MI) and discusses an algorithm for hallucination detection based on thresholding this estimator. Experiments demonstrate the effectiveness of the proposed method, showing superior performance compared to baseline methods in detecting hallucinations on various benchmarks. Additionally, the paper provides insights into how iterative prompting can change model outputs, particularly focusing on a single self-attention head.The paper explores uncertainty quantification in large language models (LLMs) to identify when responses are unreliable due to epistemic uncertainty, which arises from a lack of knowledge about the ground truth. The authors introduce an information-theoretic metric to detect when only epistemic uncertainty is large, which can be computed based on iterative prompting of the LLM. This metric is insensitive to aleatoric uncertainty, allowing for the detection of hallucinations in both single- and multi-answer responses. The paper also presents a finite-sample estimator for mutual information (MI) and discusses an algorithm for hallucination detection based on thresholding this estimator. Experiments demonstrate the effectiveness of the proposed method, showing superior performance compared to baseline methods in detecting hallucinations on various benchmarks. Additionally, the paper provides insights into how iterative prompting can change model outputs, particularly focusing on a single self-attention head.