29 Mar 2024 | Che Jiang, Biqing Qi, Xiangyu Hong, Dayuan Fu, Yang Cheng, Fandong Meng, Mo Yu, Bowen Zhou, Jie Zhou
This paper investigates the phenomenon of large language models (LLMs) producing hallucinations despite having access to correct factual knowledge. The study focuses on known fact hallucinations, where models provide both correct and incorrect answers for the same knowledge triplet. The researchers analyze the inference dynamics of LLMs to understand why hallucinations occur, particularly when models fail to recall parameterized knowledge. They propose a method to detect hallucinations by analyzing the dynamic changes in output token probabilities across layers of the model.
The study uses a dataset of factual knowledge queries, where the same knowledge is queried in different ways to elicit varying responses. By comparing the inference dynamics of correct and hallucinated outputs, the researchers identify patterns in how models process information. They find that in hallucinated cases, the output token probabilities do not show the expected steep increases in later layers, unlike in correct cases. This difference is used to build a classifier that can detect hallucinations with an 88% success rate.
The study also explores the role of different model components, such as attention and MLP modules, in generating hallucinations. It finds that MLP modules have a more significant impact on incorrect outputs than attention modules. Additionally, the researchers observe that the dynamic patterns of output token probabilities can be used to distinguish between correct and hallucinated outputs.
The study contributes to the understanding of LLMs' hallucination mechanisms and provides a method for detecting hallucinations in model outputs. It highlights the importance of analyzing the internal dynamics of LLMs to improve their reliability in practical applications. The findings suggest that hallucinations often result from the model's inability to recall knowledge, and that the dynamic changes in output token probabilities can be used to identify such failures. The study also emphasizes the need for further research into the mechanisms behind LLMs' hallucinations and the development of methods to mitigate them.This paper investigates the phenomenon of large language models (LLMs) producing hallucinations despite having access to correct factual knowledge. The study focuses on known fact hallucinations, where models provide both correct and incorrect answers for the same knowledge triplet. The researchers analyze the inference dynamics of LLMs to understand why hallucinations occur, particularly when models fail to recall parameterized knowledge. They propose a method to detect hallucinations by analyzing the dynamic changes in output token probabilities across layers of the model.
The study uses a dataset of factual knowledge queries, where the same knowledge is queried in different ways to elicit varying responses. By comparing the inference dynamics of correct and hallucinated outputs, the researchers identify patterns in how models process information. They find that in hallucinated cases, the output token probabilities do not show the expected steep increases in later layers, unlike in correct cases. This difference is used to build a classifier that can detect hallucinations with an 88% success rate.
The study also explores the role of different model components, such as attention and MLP modules, in generating hallucinations. It finds that MLP modules have a more significant impact on incorrect outputs than attention modules. Additionally, the researchers observe that the dynamic patterns of output token probabilities can be used to distinguish between correct and hallucinated outputs.
The study contributes to the understanding of LLMs' hallucination mechanisms and provides a method for detecting hallucinations in model outputs. It highlights the importance of analyzing the internal dynamics of LLMs to improve their reliability in practical applications. The findings suggest that hallucinations often result from the model's inability to recall knowledge, and that the dynamic changes in output token probabilities can be used to identify such failures. The study also emphasizes the need for further research into the mechanisms behind LLMs' hallucinations and the development of methods to mitigate them.