12 Aug 2024 | Abhika Mishra, Akari Asai, Vidhisha Balachandran, Yizhong Wang, Yulia Tsvetkov, Graham Neubig, Hannaneh Hajishirzi
This paper addresses the issue of hallucinations, or factual errors, in large language models (LMs). The authors propose a comprehensive taxonomy of hallucinations and introduce a novel task of automatic fine-grained hallucination detection. They construct the FAVA BENCH benchmark, which includes about 1,000 fine-grained human judgments on three LM outputs across various domains. The analysis reveals that ChatGPT and Llama2-Chat (70B, 7B) exhibit diverse types of hallucinations in most of their outputs, highlighting the need for fine-grained systems. To address this, the authors train FAVA, a retrieval-augmented LM, using synthetic data to detect and correct fine-grained hallucinations. FAVA significantly outperforms retrieval-augmented ChatGPT and GPT-4 on fine-grained hallucination detection and editing tasks, improving the factuality score of Alapaca 7, 13B, and ChatGPT by 4.4, 9.3, and 3.3%, respectively. The paper also discusses the challenges and future directions in fine-grained hallucination detection and editing.This paper addresses the issue of hallucinations, or factual errors, in large language models (LMs). The authors propose a comprehensive taxonomy of hallucinations and introduce a novel task of automatic fine-grained hallucination detection. They construct the FAVA BENCH benchmark, which includes about 1,000 fine-grained human judgments on three LM outputs across various domains. The analysis reveals that ChatGPT and Llama2-Chat (70B, 7B) exhibit diverse types of hallucinations in most of their outputs, highlighting the need for fine-grained systems. To address this, the authors train FAVA, a retrieval-augmented LM, using synthetic data to detect and correct fine-grained hallucinations. FAVA significantly outperforms retrieval-augmented ChatGPT and GPT-4 on fine-grained hallucination detection and editing tasks, improving the factuality score of Alapaca 7, 13B, and ChatGPT by 4.4, 9.3, and 3.3%, respectively. The paper also discusses the challenges and future directions in fine-grained hallucination detection and editing.