29 Mar 2024 | Che Jiang, Biqing Qi, Xiangyu Hong, Dayuan Fu, Yang Cheng, Fandong Meng, Mo Yu, Bowen Zhou, Jie Zhou
This paper investigates the phenomenon of Large Language Models (LLMs) hallucinating known facts, focusing on the inference dynamics that lead to these errors. The authors identify two key ideas: first, they analyze factual questions that query the same triplet knowledge but result in different answers, and second, they use mappings from residual streams to vocabulary space to measure the pattern of output token probabilities. They find that in hallucinated cases, the output token's information rarely demonstrates abrupt increases and consistent superiority in the later stages of the model. Leveraging the dynamic curve as a feature, they build a classifier capable of accurately detecting hallucinatory predictions with an 88% success rate. The study sheds light on understanding the reasons for LLMs' hallucinations and provides a method to predict when they are hallucinating. The findings suggest that hallucinations arise from failed knowledge recall, with MLP modules having a more significant impact on incorrect outputs compared to attention modules. The dynamic patterns of output tokens can be used for accurate hallucination detection, achieving an 88% successful detection rate. The paper also discusses the limitations of the study and suggests future directions for research.This paper investigates the phenomenon of Large Language Models (LLMs) hallucinating known facts, focusing on the inference dynamics that lead to these errors. The authors identify two key ideas: first, they analyze factual questions that query the same triplet knowledge but result in different answers, and second, they use mappings from residual streams to vocabulary space to measure the pattern of output token probabilities. They find that in hallucinated cases, the output token's information rarely demonstrates abrupt increases and consistent superiority in the later stages of the model. Leveraging the dynamic curve as a feature, they build a classifier capable of accurately detecting hallucinatory predictions with an 88% success rate. The study sheds light on understanding the reasons for LLMs' hallucinations and provides a method to predict when they are hallucinating. The findings suggest that hallucinations arise from failed knowledge recall, with MLP modules having a more significant impact on incorrect outputs compared to attention modules. The dynamic patterns of output tokens can be used for accurate hallucination detection, achieving an 88% successful detection rate. The paper also discusses the limitations of the study and suggests future directions for research.