Understanding TruthX%3A Alleviating Hallucinations by Editing Large Language Models in Truthful Space

**TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space** **Authors:** Shaolei Zhang, Tian Yu, Yang Feng **Institution:** Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences; Key Laboratory of AI Safety, Chinese Academy of Sciences; University of Chinese Academy of Sciences **Abstract:** Large Language Models (LLMs) often generate untruthful responses, even when they possess correct knowledge. Activating truthfulness within LLMs is crucial for unlocking their full potential. This paper introduces *TruthX*, an inference-time intervention method that enhances LLMs' truthfulness by identifying and editing features within their internal representations. TruthX uses an auto-encoder to map LLMs' representations into semantic and truthful latent spaces, and applies contrastive learning to identify a truthful editing direction. During inference, editing LLMs' internal representations in the truthful space effectively improves their truthfulness. Experiments show that TruthX increases the truthfulness of 13 advanced LLMs by an average of 20% on the TruthfulQA benchmark. Further analyses reveal that TruthX can control LLMs to produce either truthful or hallucinatory responses by editing a single vector in their internal representations. **Introduction:** LLMs have demonstrated remarkable capabilities in natural language processing tasks but sometimes generate untruthful responses, known as "hallucinations." This issue undermines the credibility of LLMs. Recent research indicates that LLMs can generate truthful responses in some contexts but hallucinate in others, even when they have correct knowledge. TruthX addresses this by editing LLMs' internal representations in the truthful space, decoupling them into truthful and semantic latent spaces using an auto-encoder. Contrastive learning is employed to probe and edit representations, enhancing truthfulness without compromising generative capabilities. **Related Work:** Recent efforts to enhance LLMs' truthfulness include contrast decoding and representation editing. Contrast decoding modifies output probabilities, while representation editing focuses on controllability and lightweight properties. TruthX differs by editing all internal representations rather than just attention heads, and it operates in the truthful space, demonstrating more effective truthfulness enhancement. **TruthX:** TruthX involves extracting internal representations from LLMs, mapping them to truthful and semantic latent spaces using an auto-encoder, and applying contrastive learning to probe and edit representations. During inference, TruthX edits LLMs' internal representations in the truthful space to enhance truthfulness. **Experiments:** TruthX is evaluated on the TruthfulQA benchmark and other benchmarks, showing significant improvements in truthfulness. It outperforms baselines and state-of-the-art methods, demonstrating robust generalization across different LLMs and domains. **Analyses:** Ablation studies and visualizations confirm the effectiveness of TruthX's components. Editing in the truthful space directly influences truthfulness, while editing in the semantic space**TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space** **Authors:** Shaolei Zhang, Tian Yu, Yang Feng **Institution:** Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences; Key Laboratory of AI Safety, Chinese Academy of Sciences; University of Chinese Academy of Sciences **Abstract:** Large Language Models (LLMs) often generate untruthful responses, even when they possess correct knowledge. Activating truthfulness within LLMs is crucial for unlocking their full potential. This paper introduces *TruthX*, an inference-time intervention method that enhances LLMs' truthfulness by identifying and editing features within their internal representations. TruthX uses an auto-encoder to map LLMs' representations into semantic and truthful latent spaces, and applies contrastive learning to identify a truthful editing direction. During inference, editing LLMs' internal representations in the truthful space effectively improves their truthfulness. Experiments show that TruthX increases the truthfulness of 13 advanced LLMs by an average of 20% on the TruthfulQA benchmark. Further analyses reveal that TruthX can control LLMs to produce either truthful or hallucinatory responses by editing a single vector in their internal representations. **Introduction:** LLMs have demonstrated remarkable capabilities in natural language processing tasks but sometimes generate untruthful responses, known as "hallucinations." This issue undermines the credibility of LLMs. Recent research indicates that LLMs can generate truthful responses in some contexts but hallucinate in others, even when they have correct knowledge. TruthX addresses this by editing LLMs' internal representations in the truthful space, decoupling them into truthful and semantic latent spaces using an auto-encoder. Contrastive learning is employed to probe and edit representations, enhancing truthfulness without compromising generative capabilities. **Related Work:** Recent efforts to enhance LLMs' truthfulness include contrast decoding and representation editing. Contrast decoding modifies output probabilities, while representation editing focuses on controllability and lightweight properties. TruthX differs by editing all internal representations rather than just attention heads, and it operates in the truthful space, demonstrating more effective truthfulness enhancement. **TruthX:** TruthX involves extracting internal representations from LLMs, mapping them to truthful and semantic latent spaces using an auto-encoder, and applying contrastive learning to probe and edit representations. During inference, TruthX edits LLMs' internal representations in the truthful space to enhance truthfulness. **Experiments:** TruthX is evaluated on the TruthfulQA benchmark and other benchmarks, showing significant improvements in truthfulness. It outperforms baselines and state-of-the-art methods, demonstrating robust generalization across different LLMs and domains. **Analyses:** Ablation studies and visualizations confirm the effectiveness of TruthX's components. Editing in the truthful space directly influences truthfulness, while editing in the semantic space

TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space

5 Jun 2024 | Shaolei Zhang, Tian Yu, Yang Feng