3 Jul 2024 | Ziwei Ji, Delong Chen, Etsuko Ishii, Samuel Cahyawijaya, Yejin Bang, Bryan Willie, Pascale Fung
This paper investigates whether Large Language Models (LLMs) can estimate their own hallucination risk before generating responses. The study analyzes the internal mechanisms of LLMs, including training data sources and across 15 diverse Natural Language Generation (NLG) tasks, spanning over 700 datasets. The empirical analysis reveals two key insights: (1) LLM internal states indicate whether they have seen the query in training data or not; and (2) LLM internal states show they are likely to hallucinate or not regarding the query. The study explores particular neurons, activation layers, and tokens that play a crucial role in the LLM perception of uncertainty and hallucination risk. By using a probing estimator, the researchers leverage LLM self-assessment, achieving an average hallucination estimation accuracy of 84.32% at run time.
The study highlights the lack of cognitive uncertainty estimation in LLM-based AI assistants, which leads to overconfidence and potential hallucinations. The researchers investigate whether LLM internal states can reliably estimate hallucination risk before response generation. They construct datasets to distinguish between seen and unseen queries in training data and assess the likelihood of hallucination risk for each query. The results show that LLM internal states can effectively estimate hallucination risk, with an accuracy of 80.28% in identifying seen/unseen queries and 84.32% in estimating hallucination risk across 15 NLG tasks.
The study also explores the role of specific neurons and activation layers in the LLM's perception of uncertainty and hallucination risk. The results demonstrate a positive correlation between the depth of activation layers and predictive accuracy. The consistency of internal states across different models suggests a potential for zero-shot transfer, but model-specific estimation is the optimal strategy. The study also evaluates the efficiency of the internal state-based estimator, which requires minimal computing power and demonstrates impressive inference speed.
The findings suggest that understanding LLM internal states could offer a proactive approach to estimating uncertainty and potentially serve as an early indicator for retrieval augmentation or an early warning system. The study also highlights the challenges of generalizing findings across different tasks and the need for further research to refine the methodology for improved robustness and generalization in hallucination risk assessment. The study concludes that LLMs have the latent capacity to self-assess and estimate hallucination risks prior to response generation, with internal state-based self-assessment outperforming PPL-based and prompting-based baselines.This paper investigates whether Large Language Models (LLMs) can estimate their own hallucination risk before generating responses. The study analyzes the internal mechanisms of LLMs, including training data sources and across 15 diverse Natural Language Generation (NLG) tasks, spanning over 700 datasets. The empirical analysis reveals two key insights: (1) LLM internal states indicate whether they have seen the query in training data or not; and (2) LLM internal states show they are likely to hallucinate or not regarding the query. The study explores particular neurons, activation layers, and tokens that play a crucial role in the LLM perception of uncertainty and hallucination risk. By using a probing estimator, the researchers leverage LLM self-assessment, achieving an average hallucination estimation accuracy of 84.32% at run time.
The study highlights the lack of cognitive uncertainty estimation in LLM-based AI assistants, which leads to overconfidence and potential hallucinations. The researchers investigate whether LLM internal states can reliably estimate hallucination risk before response generation. They construct datasets to distinguish between seen and unseen queries in training data and assess the likelihood of hallucination risk for each query. The results show that LLM internal states can effectively estimate hallucination risk, with an accuracy of 80.28% in identifying seen/unseen queries and 84.32% in estimating hallucination risk across 15 NLG tasks.
The study also explores the role of specific neurons and activation layers in the LLM's perception of uncertainty and hallucination risk. The results demonstrate a positive correlation between the depth of activation layers and predictive accuracy. The consistency of internal states across different models suggests a potential for zero-shot transfer, but model-specific estimation is the optimal strategy. The study also evaluates the efficiency of the internal state-based estimator, which requires minimal computing power and demonstrates impressive inference speed.
The findings suggest that understanding LLM internal states could offer a proactive approach to estimating uncertainty and potentially serve as an early indicator for retrieval augmentation or an early warning system. The study also highlights the challenges of generalizing findings across different tasks and the need for further research to refine the methodology for improved robustness and generalization in hallucination risk assessment. The study concludes that LLMs have the latent capacity to self-assess and estimate hallucination risks prior to response generation, with internal state-based self-assessment outperforming PPL-based and prompting-based baselines.