[slides and audio] Measuring and Reducing LLM Hallucination without Gold-Standard Answers via Expertise-Weighting

The paper addresses the issue of hallucination in large language models (LLMs), which refers to generating factually incorrect but seemingly convincing answers. Existing metrics for measuring hallucination require gold-standard answers, which are costly and prone to human errors. To overcome this, the authors propose Factualness Evaluations via Weighing LLMs (FEWL), a novel metric designed for scenarios without gold-standard answers. FEWL leverages off-the-shelf LLMs as proxies for gold-standard answers and quantifies their expertise by generating wrong answers and measuring how disagreeable these answers are to the reference LLMs. The key challenge is to weight each reference LLM's expertise, which is achieved by assessing how likely the LLM is to disagree with wrong answers and how superficial its knowledge is. FEWL is theoretically guaranteed to select the best-performing LLM as if gold-standard answers were available. Empirical results on datasets like Truthful-QA, CHALE, and HaluEval demonstrate that FEWL provides more accurate hallucination measures compared to naive methods and can be used to reduce hallucination through in-context learning and supervised fine-tuning. The approach is significantly cheaper than collecting human annotations, making it a practical solution for measuring and reducing hallucination in LLMs.The paper addresses the issue of hallucination in large language models (LLMs), which refers to generating factually incorrect but seemingly convincing answers. Existing metrics for measuring hallucination require gold-standard answers, which are costly and prone to human errors. To overcome this, the authors propose Factualness Evaluations via Weighing LLMs (FEWL), a novel metric designed for scenarios without gold-standard answers. FEWL leverages off-the-shelf LLMs as proxies for gold-standard answers and quantifies their expertise by generating wrong answers and measuring how disagreeable these answers are to the reference LLMs. The key challenge is to weight each reference LLM's expertise, which is achieved by assessing how likely the LLM is to disagree with wrong answers and how superficial its knowledge is. FEWL is theoretically guaranteed to select the best-performing LLM as if gold-standard answers were available. Empirical results on datasets like Truthful-QA, CHALE, and HaluEval demonstrate that FEWL provides more accurate hallucination measures compared to naive methods and can be used to reduce hallucination through in-context learning and supervised fine-tuning. The approach is significantly cheaper than collecting human annotations, making it a practical solution for measuring and reducing hallucination in LLMs.

MEASURING AND REDUCING LLM HALLUCINATION WITHOUT GOLD-STANDARD ANSWERS

6 Jun 2024 | Jiaheng Wei, Yuanshun Yao, Jean-Francois Ton, Hongyi Guo, Andrew Estornell, Yang Liu