OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs

OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs

9 May 2024 | Yuxia Wang, Minghan Wang, Hasan Iqbal, Georgi Georgiev, Jiahui Geng, Preslav Nakov
OpenFactCheck is a unified framework for evaluating the factuality of large language models (LLMs). It consists of three modules: CUSTCHECKER, which allows users to customize and verify the factual accuracy of documents and claims; LLMEVAL, which assesses LLM factuality from multiple perspectives; and CHECKEREVAL, which evaluates the reliability of automatic fact-checkers using human-annotated data. The framework is publicly available at https://github.com/yuxiaw/OpenFactCheck. LLMs, while powerful, often produce content that deviates from real-world facts, degrading system performance and reliability. Existing studies on LLM factuality evaluation face challenges, including the difficulty of assessing open-domain free-form responses and the lack of standardized benchmarks. OpenFactCheck addresses these issues by providing a unified, extensible, and customizable framework for factuality evaluation. CUSTCHECKER enables users to select and customize fact-checking components, such as claim processors, retrievers, and verifiers, to evaluate the factual accuracy of documents and claims. LLMEVAL uses seven factuality-specific benchmarks to evaluate LLMs from various angles and generates reports to highlight weaknesses and provide improvement suggestions. CHECKEREVAL assesses the accuracy of fact-checkers and provides a leaderboard to encourage the development of more effective systems. The three modules work together to enhance factuality evaluation. Human verification results from LLMEVAL can serve as benchmarks for evaluating automated fact-checkers, while the most effective checker from CHECKEREVAL can be deployed for automated tasks. The framework supports flexible configurations and integrates with existing methods and datasets. OpenFactCheck aims to facilitate the evaluation of LLM factuality by providing a standardized platform for users, developers, and researchers. It enables users to verify the factual accuracy of claims and documents, allows LLM developers to evaluate models under consistent metrics, and helps researchers assess the reliability of fact-checkers using fine-grained benchmarks. The framework is implemented using Python, a web interface, and a database, deployed on AWS. It supports customizable and extensible configurations, allowing users to develop their own task solvers or integrate existing ones. The system is designed to be compatible with various fact-checking systems and datasets. Experiments show that LLMs often produce factually correct responses to open-domain questions, but they struggle with certain types of factual errors, such as snowballing hallucinations and over-commitment to false premises. Fact-checking systems face challenges in accurately identifying false claims, particularly when retrieving relevant evidence. The accuracy of these systems depends on the choice of search tools, prompts, and backend LLMs. Overall, OpenFactCheck provides a comprehensive framework for evaluating the factuality of LLMs, supporting the development of more reliable and accurate fact-checking systems. It encourages the integration of new techniques, features, and evaluation benchmarks to advance research in LLM factuality evaluation.OpenFactCheck is a unified framework for evaluating the factuality of large language models (LLMs). It consists of three modules: CUSTCHECKER, which allows users to customize and verify the factual accuracy of documents and claims; LLMEVAL, which assesses LLM factuality from multiple perspectives; and CHECKEREVAL, which evaluates the reliability of automatic fact-checkers using human-annotated data. The framework is publicly available at https://github.com/yuxiaw/OpenFactCheck. LLMs, while powerful, often produce content that deviates from real-world facts, degrading system performance and reliability. Existing studies on LLM factuality evaluation face challenges, including the difficulty of assessing open-domain free-form responses and the lack of standardized benchmarks. OpenFactCheck addresses these issues by providing a unified, extensible, and customizable framework for factuality evaluation. CUSTCHECKER enables users to select and customize fact-checking components, such as claim processors, retrievers, and verifiers, to evaluate the factual accuracy of documents and claims. LLMEVAL uses seven factuality-specific benchmarks to evaluate LLMs from various angles and generates reports to highlight weaknesses and provide improvement suggestions. CHECKEREVAL assesses the accuracy of fact-checkers and provides a leaderboard to encourage the development of more effective systems. The three modules work together to enhance factuality evaluation. Human verification results from LLMEVAL can serve as benchmarks for evaluating automated fact-checkers, while the most effective checker from CHECKEREVAL can be deployed for automated tasks. The framework supports flexible configurations and integrates with existing methods and datasets. OpenFactCheck aims to facilitate the evaluation of LLM factuality by providing a standardized platform for users, developers, and researchers. It enables users to verify the factual accuracy of claims and documents, allows LLM developers to evaluate models under consistent metrics, and helps researchers assess the reliability of fact-checkers using fine-grained benchmarks. The framework is implemented using Python, a web interface, and a database, deployed on AWS. It supports customizable and extensible configurations, allowing users to develop their own task solvers or integrate existing ones. The system is designed to be compatible with various fact-checking systems and datasets. Experiments show that LLMs often produce factually correct responses to open-domain questions, but they struggle with certain types of factual errors, such as snowballing hallucinations and over-commitment to false premises. Fact-checking systems face challenges in accurately identifying false claims, particularly when retrieving relevant evidence. The accuracy of these systems depends on the choice of search tools, prompts, and backend LLMs. Overall, OpenFactCheck provides a comprehensive framework for evaluating the factuality of LLMs, supporting the development of more reliable and accurate fact-checking systems. It encourages the integration of new techniques, features, and evaluation benchmarks to advance research in LLM factuality evaluation.
Reach us at info@study.space