Understanding Long-form factuality in large language models

This paper addresses the challenge of evaluating the long-form factuality of responses generated by large language models (LLMs). To benchmark this, the authors use GPT-4 to generate a dataset called LongFact, which consists of 2,280 questions spanning 38 topics. They propose a method called Search-Augmented Factuality Evaluator (SAFE), which leverages an LLM to break down long-form responses into individual facts and evaluate their accuracy using Google Search. SAFE is compared to crowdsourced human annotators, showing that it agrees with them 72% of the time and wins 76% of disagreement cases, while being 20 times cheaper. The authors also introduce a metric, $F_1 @ K$, which combines factual precision and recall, where $K$ is a hyperparameter representing the preferred number of facts in a response. Thirteen LLMs from four families (Gemini, GPT, Claude, and PaLM-2) are benchmarked, with larger models generally achieving better long-form factuality. The paper concludes by discussing limitations and future directions, emphasizing the need for further research in improving LLMs' long-form factuality and addressing hallucination in long-form settings.This paper addresses the challenge of evaluating the long-form factuality of responses generated by large language models (LLMs). To benchmark this, the authors use GPT-4 to generate a dataset called LongFact, which consists of 2,280 questions spanning 38 topics. They propose a method called Search-Augmented Factuality Evaluator (SAFE), which leverages an LLM to break down long-form responses into individual facts and evaluate their accuracy using Google Search. SAFE is compared to crowdsourced human annotators, showing that it agrees with them 72% of the time and wins 76% of disagreement cases, while being 20 times cheaper. The authors also introduce a metric, $F_1 @ K$, which combines factual precision and recall, where $K$ is a hyperparameter representing the preferred number of facts in a response. Thirteen LLMs from four families (Gemini, GPT, Claude, and PaLM-2) are benchmarked, with larger models generally achieving better long-form factuality. The paper concludes by discussing limitations and future directions, emphasizing the need for further research in improving LLMs' long-form factuality and addressing hallucination in long-form settings.

LONG-FORM FACTUALITY IN LARGE LANGUAGE MODELS

3 Apr 2024 | Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, Quoc V. Le