Measuring and Reducing LLM Hallucination without Gold-Standard Answers

Measuring and Reducing LLM Hallucination without Gold-Standard Answers

6 Jun 2024 | Jiaheng Wei, Yuanshun Yao, Jean-Francois Ton, Hongyi Guo, Andrew Estornell, Yang Liu
This paper introduces FEWL, a novel hallucination metric that measures the factualness of large language model (LLM) answers without requiring gold-standard answers. FEWL leverages off-the-shelf LLMs as proxies for gold-standard answers and quantifies the expertise of these LLMs to assess the factualness of answers. The key idea is to measure the expertise of each reference LLM by evaluating how well they agree with intentionally wrong answers and how well they possess real, expert-level knowledge about the question. FEWL also incorporates a laziness penalty to account for superficial or irrelevant knowledge in reference LLMs. Theoretical analysis shows that FEWL has certain guarantees and is more accurate than naive methods. Empirical results on benchmark datasets like Truthful-QA, CHALE, and HaluEval demonstrate that FEWL effectively measures hallucination and can be used to reduce it through in-context learning and supervised fine-tuning. FEWL is significantly cheaper than collecting human annotations and is more efficient than existing methods. The paper also shows that FEWL can be used to rank LLMs based on their hallucination performance and to mitigate hallucination by guiding training processes. Overall, FEWL provides a practical and effective solution for measuring and reducing LLM hallucination in scenarios where gold-standard answers are unavailable.This paper introduces FEWL, a novel hallucination metric that measures the factualness of large language model (LLM) answers without requiring gold-standard answers. FEWL leverages off-the-shelf LLMs as proxies for gold-standard answers and quantifies the expertise of these LLMs to assess the factualness of answers. The key idea is to measure the expertise of each reference LLM by evaluating how well they agree with intentionally wrong answers and how well they possess real, expert-level knowledge about the question. FEWL also incorporates a laziness penalty to account for superficial or irrelevant knowledge in reference LLMs. Theoretical analysis shows that FEWL has certain guarantees and is more accurate than naive methods. Empirical results on benchmark datasets like Truthful-QA, CHALE, and HaluEval demonstrate that FEWL effectively measures hallucination and can be used to reduce it through in-context learning and supervised fine-tuning. FEWL is significantly cheaper than collecting human annotations and is more efficient than existing methods. The paper also shows that FEWL can be used to rank LLMs based on their hallucination performance and to mitigate hallucination by guiding training processes. Overall, FEWL provides a practical and effective solution for measuring and reducing LLM hallucination in scenarios where gold-standard answers are unavailable.
Reach us at info@study.space
Understanding Measuring and Reducing LLM Hallucination without Gold-Standard Answers via Expertise-Weighting