2024 | Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, Jieping Ye
This paper proposes a novel approach for detecting hallucinations in large language models (LLMs) by leveraging the internal states of LLMs. The key idea is to utilize the dense semantic information retained in the internal states of LLMs for hallucination detection, which is more effective than traditional methods that rely on logit-level or language-level uncertainty estimation. The proposed method, called INSIDE, includes two main components: (1) an EigenScore metric that measures the semantic consistency of responses using the eigenvalues of the covariance matrix of sentence embeddings, and (2) a test-time feature clipping approach that truncates extreme activations in the internal states to reduce overconfident generations and improve the detection of overconfident hallucinations.
The EigenScore metric is designed to capture the semantic divergence in the dense embedding space, which is more effective than existing uncertainty or consistency metrics that operate in logit or language space. It is based on the logarithm of the determinant of the covariance matrix of sentence embeddings, which reflects the differential entropy in the embedding space. The feature clipping approach is used to reduce the impact of extreme activations in the internal states, which can lead to overconfident generations. This approach is implemented by clipping the activations of the penultimate layer of the LLMs to a certain threshold.
Extensive experiments on several popular LLMs and question-answering (QA) benchmarks demonstrate the effectiveness of the proposed method. The results show that the EigenScore outperforms existing methods in terms of hallucination detection performance, particularly in CoQA and SQuAD datasets. Additionally, the feature clipping approach significantly improves the performance of the method, as shown in the experiments. The proposed method is also compared with other state-of-the-art methods, including Semantic Entropy, Shifting Attention to Relevance, and SelfCheckGPT, and it consistently outperforms them in terms of hallucination detection performance.
The paper also discusses the computational efficiency of the proposed method, showing that it is more efficient than other methods that rely on another large model to measure self-consistency. The results indicate that the EigenScore is about 10 times more efficient than such methods, and the computational overhead of the feature clipping and EigenScore computation is negligible. The proposed method is also evaluated on more LLMs, including LLaMA2-7B and Falcon-7B, and it consistently exhibits superior performance compared to other methods. The paper concludes that the proposed method is a promising approach for detecting hallucinations in LLMs, as it effectively leverages the internal states of LLMs to capture the semantic information and improve the detection of hallucinations.This paper proposes a novel approach for detecting hallucinations in large language models (LLMs) by leveraging the internal states of LLMs. The key idea is to utilize the dense semantic information retained in the internal states of LLMs for hallucination detection, which is more effective than traditional methods that rely on logit-level or language-level uncertainty estimation. The proposed method, called INSIDE, includes two main components: (1) an EigenScore metric that measures the semantic consistency of responses using the eigenvalues of the covariance matrix of sentence embeddings, and (2) a test-time feature clipping approach that truncates extreme activations in the internal states to reduce overconfident generations and improve the detection of overconfident hallucinations.
The EigenScore metric is designed to capture the semantic divergence in the dense embedding space, which is more effective than existing uncertainty or consistency metrics that operate in logit or language space. It is based on the logarithm of the determinant of the covariance matrix of sentence embeddings, which reflects the differential entropy in the embedding space. The feature clipping approach is used to reduce the impact of extreme activations in the internal states, which can lead to overconfident generations. This approach is implemented by clipping the activations of the penultimate layer of the LLMs to a certain threshold.
Extensive experiments on several popular LLMs and question-answering (QA) benchmarks demonstrate the effectiveness of the proposed method. The results show that the EigenScore outperforms existing methods in terms of hallucination detection performance, particularly in CoQA and SQuAD datasets. Additionally, the feature clipping approach significantly improves the performance of the method, as shown in the experiments. The proposed method is also compared with other state-of-the-art methods, including Semantic Entropy, Shifting Attention to Relevance, and SelfCheckGPT, and it consistently outperforms them in terms of hallucination detection performance.
The paper also discusses the computational efficiency of the proposed method, showing that it is more efficient than other methods that rely on another large model to measure self-consistency. The results indicate that the EigenScore is about 10 times more efficient than such methods, and the computational overhead of the feature clipping and EigenScore computation is negligible. The proposed method is also evaluated on more LLMs, including LLaMA2-7B and Falcon-7B, and it consistently exhibits superior performance compared to other methods. The paper concludes that the proposed method is a promising approach for detecting hallucinations in LLMs, as it effectively leverages the internal states of LLMs to capture the semantic information and improve the detection of hallucinations.