TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space

TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space

5 Jun 2024 | Shaolei Zhang, Tian Yu, Yang Feng
TruthX is an approach to enhance the truthfulness of large language models (LLMs) by editing their internal representations in the truthful space. The method involves mapping LLM representations into semantic and truthful latent spaces using an auto-encoder, and then applying contrastive learning to identify a truthful editing direction. During inference, TruthX edits the LLM's internal representations in the truthful space to enhance truthfulness. Experiments show that TruthX improves the truthfulness of 13 advanced LLMs by an average of 20% on the TruthfulQA benchmark. Further analyses suggest that TruthX can control LLMs to produce truthful or hallucinatory responses by editing only one vector in their internal representations. TruthX demonstrates superior truthfulness control, with editing along the truthful direction enhancing truthfulness and editing along the opposite direction yielding hallucinatory responses. The truthful space extracted from homologous LLMs exhibits high similarity, allowing the use of a well-trained TruthX for different models. Layer-wise analysis indicates that middle layers of LLMs have a higher correlation with truthfulness. TruthX outperforms existing methods such as contrastive decoding and representation editing, achieving significant improvements in truthfulness without compromising generative capabilities. TruthX is effective across various LLMs and benchmarks, demonstrating strong generalization and versatility. The method is promising for improving the reliability of LLMs by enhancing their truthfulness.TruthX is an approach to enhance the truthfulness of large language models (LLMs) by editing their internal representations in the truthful space. The method involves mapping LLM representations into semantic and truthful latent spaces using an auto-encoder, and then applying contrastive learning to identify a truthful editing direction. During inference, TruthX edits the LLM's internal representations in the truthful space to enhance truthfulness. Experiments show that TruthX improves the truthfulness of 13 advanced LLMs by an average of 20% on the TruthfulQA benchmark. Further analyses suggest that TruthX can control LLMs to produce truthful or hallucinatory responses by editing only one vector in their internal representations. TruthX demonstrates superior truthfulness control, with editing along the truthful direction enhancing truthfulness and editing along the opposite direction yielding hallucinatory responses. The truthful space extracted from homologous LLMs exhibits high similarity, allowing the use of a well-trained TruthX for different models. Layer-wise analysis indicates that middle layers of LLMs have a higher correlation with truthfulness. TruthX outperforms existing methods such as contrastive decoding and representation editing, achieving significant improvements in truthfulness without compromising generative capabilities. TruthX is effective across various LLMs and benchmarks, demonstrating strong generalization and versatility. The method is promising for improving the reliability of LLMs by enhancing their truthfulness.
Reach us at info@study.space