**Title:** Lynx: An Open Source Hallucination Evaluation Model
**Authors:** Selvan Sunitha Ravi, Bartosz Mielczarek, Anand Kannappan, Douwe Kiela, Rebecca Qian
**Institutional Affiliations:** Patronus AI, Contextual AI, Stanford University
**Abstract:**
Retrieval Augmented Generation (RAG) techniques aim to mitigate hallucinations in Large Language Models (LLMs). However, LLMs can still produce unsupported or contradictory information. This paper introduces LYNX, a state-of-the-art hallucination detection LLM capable of advanced reasoning on challenging real-world scenarios. To evaluate LYNX, the authors present HaluBench, a comprehensive hallucination evaluation benchmark consisting of 15k samples from various real-world domains. The results show that LYNX outperforms GPT-4o, Claude-3-Sonnet, and other closed and open-source LLM-as-judge models on HaluBench. The authors release LYNX, HaluBench, and their evaluation code for public access.
**Introduction:**
Large Language Models (LLMs) often produce hallucinations, leading to the development of RAG systems to mitigate this issue. However, these systems are still prone to generating inconsistent text. The paper proposes LYNX, an open-source hallucination detection model that outperforms GPT-4o and closed-source LLMs. LYNX is trained using a combination of existing QA datasets and synthetic data perturbations to generate challenging hallucination examples. The authors construct HaluBench, a large-scale hallucination evaluation benchmark with 15k samples, including both hallucinated and faithful responses across multiple real-world domains.
**Contributions:**
- Introduction of HaluBench, a comprehensive hallucination evaluation benchmark.
- Training of LYNX, the first open-source LLM capable of high-quality, reference-free hallucination detection.
- Development of a novel method to generate hard-to-detect hallucination examples from QA tasks.
- Benchmarking LYNX against closed and open-source LLMs and RAG evaluation metrics.
**Methods:**
The paper describes the process of training LYNX, including the definition of hallucination, construction of the training and evaluation data, and experimental results on real-world domains.
**Results:**
LYNX achieves the highest accuracy on HaluBench, outperforming GPT-4o and other models. The 70B version of LYNX shows a significant improvement over GPT-4o, with an average accuracy increase of 7.8%.
**Conclusion:**
LYNX provides a reference-free metric for automated RAG evaluation, addressing the critical need for safe deployment of RAG systems. The authors release all models, datasets, and experiment results for public access.
**Limitations and Future Work:**
The paper discusses limitations**Title:** Lynx: An Open Source Hallucination Evaluation Model
**Authors:** Selvan Sunitha Ravi, Bartosz Mielczarek, Anand Kannappan, Douwe Kiela, Rebecca Qian
**Institutional Affiliations:** Patronus AI, Contextual AI, Stanford University
**Abstract:**
Retrieval Augmented Generation (RAG) techniques aim to mitigate hallucinations in Large Language Models (LLMs). However, LLMs can still produce unsupported or contradictory information. This paper introduces LYNX, a state-of-the-art hallucination detection LLM capable of advanced reasoning on challenging real-world scenarios. To evaluate LYNX, the authors present HaluBench, a comprehensive hallucination evaluation benchmark consisting of 15k samples from various real-world domains. The results show that LYNX outperforms GPT-4o, Claude-3-Sonnet, and other closed and open-source LLM-as-judge models on HaluBench. The authors release LYNX, HaluBench, and their evaluation code for public access.
**Introduction:**
Large Language Models (LLMs) often produce hallucinations, leading to the development of RAG systems to mitigate this issue. However, these systems are still prone to generating inconsistent text. The paper proposes LYNX, an open-source hallucination detection model that outperforms GPT-4o and closed-source LLMs. LYNX is trained using a combination of existing QA datasets and synthetic data perturbations to generate challenging hallucination examples. The authors construct HaluBench, a large-scale hallucination evaluation benchmark with 15k samples, including both hallucinated and faithful responses across multiple real-world domains.
**Contributions:**
- Introduction of HaluBench, a comprehensive hallucination evaluation benchmark.
- Training of LYNX, the first open-source LLM capable of high-quality, reference-free hallucination detection.
- Development of a novel method to generate hard-to-detect hallucination examples from QA tasks.
- Benchmarking LYNX against closed and open-source LLMs and RAG evaluation metrics.
**Methods:**
The paper describes the process of training LYNX, including the definition of hallucination, construction of the training and evaluation data, and experimental results on real-world domains.
**Results:**
LYNX achieves the highest accuracy on HaluBench, outperforming GPT-4o and other models. The 70B version of LYNX shows a significant improvement over GPT-4o, with an average accuracy increase of 7.8%.
**Conclusion:**
LYNX provides a reference-free metric for automated RAG evaluation, addressing the critical need for safe deployment of RAG systems. The authors release all models, datasets, and experiment results for public access.
**Limitations and Future Work:**
The paper discusses limitations