Characterizing Truthfulness in Large Language Model Generations with Local Intrinsic Dimension

Characterizing Truthfulness in Large Language Model Generations with Local Intrinsic Dimension

28 Feb 2024 | Fan Yin, Jayanth Srinivasa, Kai-Wei Chang
This paper proposes a method to characterize and predict the truthfulness of texts generated by large language models (LLMs) using the local intrinsic dimension (LID) of model activations. LID measures the minimal number of activations needed to characterize a point without significant information loss. The method is applied to four question-answering datasets, demonstrating its effectiveness in detecting hallucinations. The study reveals that truthful outputs, being closer to natural language, have smaller LIDs, while untruthful outputs have larger LIDs. The method is more generalizable and accurate than entropy-based uncertainty methods and linear probes. Experiments show that LID-based methods outperform other approaches in terms of AUROC, achieving an 8% improvement. The study also explores the relationship between intrinsic dimensions and model layers, autoregressive language modeling, and instruction tuning, showing that intrinsic dimensions can be a powerful tool for understanding LLMs. The method is implemented using maximum likelihood estimation and is robust to variations in hyperparameters and datasets. The results suggest that intrinsic dimensions can be used to detect hallucinations and improve the reliability of LLMs. The code and data are available for further research.This paper proposes a method to characterize and predict the truthfulness of texts generated by large language models (LLMs) using the local intrinsic dimension (LID) of model activations. LID measures the minimal number of activations needed to characterize a point without significant information loss. The method is applied to four question-answering datasets, demonstrating its effectiveness in detecting hallucinations. The study reveals that truthful outputs, being closer to natural language, have smaller LIDs, while untruthful outputs have larger LIDs. The method is more generalizable and accurate than entropy-based uncertainty methods and linear probes. Experiments show that LID-based methods outperform other approaches in terms of AUROC, achieving an 8% improvement. The study also explores the relationship between intrinsic dimensions and model layers, autoregressive language modeling, and instruction tuning, showing that intrinsic dimensions can be a powerful tool for understanding LLMs. The method is implemented using maximum likelihood estimation and is robust to variations in hyperparameters and datasets. The results suggest that intrinsic dimensions can be used to detect hallucinations and improve the reliability of LLMs. The code and data are available for further research.
Reach us at info@study.space