3 Apr 2024 | Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, Quoc V. Le
This paper introduces LongFact, a new benchmark for evaluating long-form factuality in large language models (LLMs), and SAFE, an automated factuality evaluator that uses LLMs to assess the accuracy of long-form responses. LongFact consists of 2,280 prompts across 38 topics, designed to elicit long-form responses that require factual accuracy. SAFE is a method that leverages LLMs to break down long-form responses into individual facts, then evaluates each fact using a multi-step reasoning process involving Google Search. SAFE outperforms human annotators in terms of accuracy and is significantly more cost-effective, agreeing with human annotations 72% of the time and winning 76% of disagreement cases. SAFE is also 20 times cheaper than human annotators.
The paper also proposes F1@K as a new metric for evaluating long-form factuality, which combines precision (the percentage of supported facts) and recall (the percentage of provided facts relative to a hyperparameter representing the user's preferred response length). The authors benchmark thirteen language models across four families (Gemini, GPT, Claude, and PaLM-2) on LongFact, finding that larger models generally achieve better long-form factuality. The results show that SAFE is a reliable and cost-effective method for evaluating long-form factuality, and that larger models tend to perform better in this task.
The paper also discusses the limitations of LongFact and SAFE, including their reliance on LLMs and the potential for errors in fact verification. The authors suggest that future research could explore ways to improve the accuracy of fact verification by incorporating more external tools and refining the use of search-augmented language models. Additionally, the paper highlights the importance of measuring factual recall in long-form settings and suggests that future work could explore how this can be applied in other domains. Overall, the paper demonstrates that SAFE is a promising tool for evaluating long-form factuality in LLMs and that larger models tend to perform better in this task.This paper introduces LongFact, a new benchmark for evaluating long-form factuality in large language models (LLMs), and SAFE, an automated factuality evaluator that uses LLMs to assess the accuracy of long-form responses. LongFact consists of 2,280 prompts across 38 topics, designed to elicit long-form responses that require factual accuracy. SAFE is a method that leverages LLMs to break down long-form responses into individual facts, then evaluates each fact using a multi-step reasoning process involving Google Search. SAFE outperforms human annotators in terms of accuracy and is significantly more cost-effective, agreeing with human annotations 72% of the time and winning 76% of disagreement cases. SAFE is also 20 times cheaper than human annotators.
The paper also proposes F1@K as a new metric for evaluating long-form factuality, which combines precision (the percentage of supported facts) and recall (the percentage of provided facts relative to a hyperparameter representing the user's preferred response length). The authors benchmark thirteen language models across four families (Gemini, GPT, Claude, and PaLM-2) on LongFact, finding that larger models generally achieve better long-form factuality. The results show that SAFE is a reliable and cost-effective method for evaluating long-form factuality, and that larger models tend to perform better in this task.
The paper also discusses the limitations of LongFact and SAFE, including their reliance on LLMs and the potential for errors in fact verification. The authors suggest that future research could explore ways to improve the accuracy of fact verification by incorporating more external tools and refining the use of search-augmented language models. Additionally, the paper highlights the importance of measuring factual recall in long-form settings and suggests that future work could explore how this can be applied in other domains. Overall, the paper demonstrates that SAFE is a promising tool for evaluating long-form factuality in LLMs and that larger models tend to perform better in this task.