Lynx: An Open Source Hallucination Evaluation Model

Lynx: An Open Source Hallucination Evaluation Model

22 Jul 2024 | Selvan Sunitha Ravi, Bartosz Mielczarek, Anand Kannappan, Douwe Kiela, Rebecca Qian
This paper introduces LYNX, an open-source hallucination detection model that outperforms GPT-4o and closed-source LLM-as-a-judge models on a comprehensive hallucination evaluation benchmark called HaluBench. LYNX is trained on a diverse set of real-world datasets, including FinanceBench, DROP, COVID-QA, and PubMedQA, to detect hallucinations in Retrieval-Augmented Generation (RAG) systems. The model is designed to evaluate the faithfulness of generated answers to the provided context, which is critical for the success of RAG systems in production. HaluBench consists of 15,000 samples sourced from various real-world domains, including finance and medicine. It includes both hallucinated and faithful responses to questions, allowing for a comprehensive evaluation of hallucination detection models. The benchmark is the first open-source dataset containing hallucination tasks sourced from real-world domains. LYNX is trained using a combination of reasoning chains and evaluation scores, similar to Natural Language Inference (NLI) tasks, which improves the interpretability of the evaluation score. The model is trained on a dataset of 2,400 samples, including 800 for validation, and is fine-tuned using the Llama-3-70B-Instruct and Llama-3-8B-Instruct checkpoints. LYNX outperforms GPT-4o and other closed-source models on HaluBench, with the 70B version achieving 87.4% accuracy. The 8B version also shows improved performance compared to the baseline Llama 3 models. LYNX is the first open-source hallucination detection model that outperforms GPT-4o and closed-source LLM-as-a-judge models. The paper also discusses related work, including the use of RAG systems for knowledge-intensive tasks, the challenges of hallucination detection, and the importance of factuality in AI systems. It highlights the limitations of existing hallucination detection methods and proposes a novel method to generate hard-to-detect hallucination examples from Question Answering tasks by applying semantic perturbations to LLM responses. The authors also discuss the limitations of their work, including the focus on English datasets and the need for further research in multilingual coverage and other NLP domains. They conclude that LYNX provides a valuable tool for evaluating the faithfulness of model responses in reference-free settings, with important implications for business contexts ranging from detecting erroneous responses in financial Q&A to preventing misinformation in medical AI assistants.This paper introduces LYNX, an open-source hallucination detection model that outperforms GPT-4o and closed-source LLM-as-a-judge models on a comprehensive hallucination evaluation benchmark called HaluBench. LYNX is trained on a diverse set of real-world datasets, including FinanceBench, DROP, COVID-QA, and PubMedQA, to detect hallucinations in Retrieval-Augmented Generation (RAG) systems. The model is designed to evaluate the faithfulness of generated answers to the provided context, which is critical for the success of RAG systems in production. HaluBench consists of 15,000 samples sourced from various real-world domains, including finance and medicine. It includes both hallucinated and faithful responses to questions, allowing for a comprehensive evaluation of hallucination detection models. The benchmark is the first open-source dataset containing hallucination tasks sourced from real-world domains. LYNX is trained using a combination of reasoning chains and evaluation scores, similar to Natural Language Inference (NLI) tasks, which improves the interpretability of the evaluation score. The model is trained on a dataset of 2,400 samples, including 800 for validation, and is fine-tuned using the Llama-3-70B-Instruct and Llama-3-8B-Instruct checkpoints. LYNX outperforms GPT-4o and other closed-source models on HaluBench, with the 70B version achieving 87.4% accuracy. The 8B version also shows improved performance compared to the baseline Llama 3 models. LYNX is the first open-source hallucination detection model that outperforms GPT-4o and closed-source LLM-as-a-judge models. The paper also discusses related work, including the use of RAG systems for knowledge-intensive tasks, the challenges of hallucination detection, and the importance of factuality in AI systems. It highlights the limitations of existing hallucination detection methods and proposes a novel method to generate hard-to-detect hallucination examples from Question Answering tasks by applying semantic perturbations to LLM responses. The authors also discuss the limitations of their work, including the focus on English datasets and the need for further research in multilingual coverage and other NLP domains. They conclude that LYNX provides a valuable tool for evaluating the faithfulness of model responses in reference-free settings, with important implications for business contexts ranging from detecting erroneous responses in financial Q&A to preventing misinformation in medical AI assistants.
Reach us at info@study.space