27 Jun 2024 | Yixiao Song, Yekyung Kim, Mohit Iyyer
VERISCORE is a new metric for evaluating the factuality of long-form text generation, focusing on verifiable claims. Existing metrics like FACTSCORE and SAFE assume all claims are verifiable, but this is not always the case in complex generation tasks. VERISCORE addresses this by extracting only verifiable claims and using Google Search results for verification. It is implemented with both closed and open-weight language models, and human evaluations show it outperforms existing methods in extracting sensible claims. VERISCORE is used to evaluate 16 models across multiple long-form tasks, finding that GPT-4o performs best, while open-weight models like Mixtral-8×22 are closing the gap. VERISCORE's results show that factuality evaluation varies across tasks and domains, and complex claims are harder to verify. The metric also highlights the need for diverse tasks in factuality evaluation. VERISCORE's claim extraction focuses on verifiable claims, avoiding unverifiable content, and its verification process uses Google Search results. Human evaluations confirm its effectiveness, and the metric is open-sourced for both closed and open-weight implementations. The paper also discusses limitations, including the difficulty of verifying complex claims and the need for more sophisticated verification methods. Overall, VERISCORE provides a more accurate and comprehensive evaluation of factuality in long-form text generation.VERISCORE is a new metric for evaluating the factuality of long-form text generation, focusing on verifiable claims. Existing metrics like FACTSCORE and SAFE assume all claims are verifiable, but this is not always the case in complex generation tasks. VERISCORE addresses this by extracting only verifiable claims and using Google Search results for verification. It is implemented with both closed and open-weight language models, and human evaluations show it outperforms existing methods in extracting sensible claims. VERISCORE is used to evaluate 16 models across multiple long-form tasks, finding that GPT-4o performs best, while open-weight models like Mixtral-8×22 are closing the gap. VERISCORE's results show that factuality evaluation varies across tasks and domains, and complex claims are harder to verify. The metric also highlights the need for diverse tasks in factuality evaluation. VERISCORE's claim extraction focuses on verifiable claims, avoiding unverifiable content, and its verification process uses Google Search results. Human evaluations confirm its effectiveness, and the metric is open-sourced for both closed and open-weight implementations. The paper also discusses limitations, including the difficulty of verifying complex claims and the need for more sophisticated verification methods. Overall, VERISCORE provides a more accurate and comprehensive evaluation of factuality in long-form text generation.