9 May 2024 | Yuxia Wang, Minghan Wang, Hasan Iqbal, Georgi Georgiev, Jiahui Geng, Preslav Nakov
The paper introduces OpenFactCheck, a unified framework for evaluating the factuality of outputs from large language models (LLMs). The framework consists of three main modules: CUSTCHECKER, LLMEVAL, and CHECKER EVAL. CUSTCHECKER allows users to customize an automatic fact-checker to verify the factual correctness of documents and claims. LLMEVAL is a unified evaluation framework that assesses LLMs' factuality ability from various perspectives and generates reports to highlight weaknesses and offer improvement advice. CHECKER EVAL is an extensible solution for evaluating the reliability of automatic fact-checkers' verification results using human-annotated datasets. The framework aims to address the challenges of assessing the factuality of open-domain responses and the lack of standardized evaluation benchmarks. OpenFactCheck is publicly available and designed to facilitate research and development in LLM factuality evaluation. The paper also includes experimental results and discussions on the performance of LLMs and fact-checking systems.The paper introduces OpenFactCheck, a unified framework for evaluating the factuality of outputs from large language models (LLMs). The framework consists of three main modules: CUSTCHECKER, LLMEVAL, and CHECKER EVAL. CUSTCHECKER allows users to customize an automatic fact-checker to verify the factual correctness of documents and claims. LLMEVAL is a unified evaluation framework that assesses LLMs' factuality ability from various perspectives and generates reports to highlight weaknesses and offer improvement advice. CHECKER EVAL is an extensible solution for evaluating the reliability of automatic fact-checkers' verification results using human-annotated datasets. The framework aims to address the challenges of assessing the factuality of open-domain responses and the lack of standardized evaluation benchmarks. OpenFactCheck is publicly available and designed to facilitate research and development in LLM factuality evaluation. The paper also includes experimental results and discussions on the performance of LLMs and fact-checking systems.