3 Apr 2024 | Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, Quoc V. Le
This paper addresses the challenge of evaluating the long-form factuality of responses generated by large language models (LLMs). To benchmark this, the authors use GPT-4 to generate a dataset called LongFact, which consists of 2,280 questions spanning 38 topics. They propose a method called Search-Augmented Factuality Evaluator (SAFE), which leverages an LLM to break down long-form responses into individual facts and evaluate their accuracy using Google Search. SAFE is compared to crowdsourced human annotators, showing that it agrees with them 72% of the time and wins 76% of disagreement cases, while being 20 times cheaper. The authors also introduce a metric, $F_1 @ K$, which combines factual precision and recall, where $K$ is a hyperparameter representing the preferred number of facts in a response. Thirteen LLMs from four families (Gemini, GPT, Claude, and PaLM-2) are benchmarked, with larger models generally achieving better long-form factuality. The paper concludes by discussing limitations and future directions, emphasizing the need for further research in improving LLMs' long-form factuality and addressing hallucination in long-form settings.This paper addresses the challenge of evaluating the long-form factuality of responses generated by large language models (LLMs). To benchmark this, the authors use GPT-4 to generate a dataset called LongFact, which consists of 2,280 questions spanning 38 topics. They propose a method called Search-Augmented Factuality Evaluator (SAFE), which leverages an LLM to break down long-form responses into individual facts and evaluate their accuracy using Google Search. SAFE is compared to crowdsourced human annotators, showing that it agrees with them 72% of the time and wins 76% of disagreement cases, while being 20 times cheaper. The authors also introduce a metric, $F_1 @ K$, which combines factual precision and recall, where $K$ is a hyperparameter representing the preferred number of facts in a response. Thirteen LLMs from four families (Gemini, GPT, Claude, and PaLM-2) are benchmarked, with larger models generally achieving better long-form factuality. The paper concludes by discussing limitations and future directions, emphasizing the need for further research in improving LLMs' long-form factuality and addressing hallucination in long-form settings.