6 May 2024 | Zhiying Zhu, Yiming Yang, Zhiqing Sun
**HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild**
**Authors:** Zhiying Zhu
**Abstract:**
Hallucinations pose a significant challenge to the reliability of large language models (LLMs) in critical domains. Existing benchmarks designed to assess LLM hallucinations within conventional NLP tasks are insufficient for capturing the complexities of user-LLM interactions in real-world settings. To address this gap, we introduce HaluEval-Wild, the first benchmark specifically designed to evaluate LLM hallucinations in real-world scenarios. We collect challenging user queries from existing datasets, including ShareGPT, and categorize them into five distinct types to enable a fine-grained analysis of hallucinations. Our benchmark offers a novel approach to enhancing the comprehension and improvement of LLM reliability in real-world interactions.
**Introduction:**
LLMs are prone to generating hallucinations—text that is coherent but factually incorrect or unverifiable. This has raised concerns about their reliability in critical domains such as journalism and legal documentation. Past benchmarks have primarily focused on traditional NLP tasks, but none have thoroughly evaluated hallucinations in real-world scenarios. HaluEval-Wild addresses this gap by curating challenging queries from real-world interactions and evaluating various LLMs on these queries.
**Construction of HaluEval-Wild:**
We start with the ShareGPT dataset, containing over 100,000 dialogues, and filter out challenging queries that significantly test the model's knowledge and reasoning capabilities. We use a fine-tuned Llama 2-7B model to pre-screen these queries and identify those prone to hallucinations. We then manually verify and categorize the queries into five types: Out-of-Scope Information (OoS), Complex Reasoning (CR), Inappropriate Content (IC), Beyond-Modality Interaction (BM), and Confused/Erroneous Queries (CE).
**Evaluation:**
We evaluate various LLMs on HaluEval-Wild, including both open-source and closed-source models. Our results show significant variance in hallucination rates among different models. For example, Alpaca 7B has a hallucination rate of 99.20%, while GPT-4 Turbo has a rate of 18.64%. We also compare HaluEval-Wild with other established benchmarks, highlighting that knowledge-distilled models exhibit higher hallucination rates.
**Conclusion:**
HaluEval-Wild provides a comprehensive benchmark for evaluating LLM hallucinations in real-world scenarios. Our findings highlight the nuanced challenge of balancing model performance with reliability, particularly in knowledge-distilled models. HaluEval-Wild advances our understanding of LLM reliability and sets a foundation for future research aimed at enhancing the factual integrity of these models. However, the benchmark has limitations, including the need for continuous updates to remain relevant and effective.**HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild**
**Authors:** Zhiying Zhu
**Abstract:**
Hallucinations pose a significant challenge to the reliability of large language models (LLMs) in critical domains. Existing benchmarks designed to assess LLM hallucinations within conventional NLP tasks are insufficient for capturing the complexities of user-LLM interactions in real-world settings. To address this gap, we introduce HaluEval-Wild, the first benchmark specifically designed to evaluate LLM hallucinations in real-world scenarios. We collect challenging user queries from existing datasets, including ShareGPT, and categorize them into five distinct types to enable a fine-grained analysis of hallucinations. Our benchmark offers a novel approach to enhancing the comprehension and improvement of LLM reliability in real-world interactions.
**Introduction:**
LLMs are prone to generating hallucinations—text that is coherent but factually incorrect or unverifiable. This has raised concerns about their reliability in critical domains such as journalism and legal documentation. Past benchmarks have primarily focused on traditional NLP tasks, but none have thoroughly evaluated hallucinations in real-world scenarios. HaluEval-Wild addresses this gap by curating challenging queries from real-world interactions and evaluating various LLMs on these queries.
**Construction of HaluEval-Wild:**
We start with the ShareGPT dataset, containing over 100,000 dialogues, and filter out challenging queries that significantly test the model's knowledge and reasoning capabilities. We use a fine-tuned Llama 2-7B model to pre-screen these queries and identify those prone to hallucinations. We then manually verify and categorize the queries into five types: Out-of-Scope Information (OoS), Complex Reasoning (CR), Inappropriate Content (IC), Beyond-Modality Interaction (BM), and Confused/Erroneous Queries (CE).
**Evaluation:**
We evaluate various LLMs on HaluEval-Wild, including both open-source and closed-source models. Our results show significant variance in hallucination rates among different models. For example, Alpaca 7B has a hallucination rate of 99.20%, while GPT-4 Turbo has a rate of 18.64%. We also compare HaluEval-Wild with other established benchmarks, highlighting that knowledge-distilled models exhibit higher hallucination rates.
**Conclusion:**
HaluEval-Wild provides a comprehensive benchmark for evaluating LLM hallucinations in real-world scenarios. Our findings highlight the nuanced challenge of balancing model performance with reliability, particularly in knowledge-distilled models. HaluEval-Wild advances our understanding of LLM reliability and sets a foundation for future research aimed at enhancing the factual integrity of these models. However, the benchmark has limitations, including the need for continuous updates to remain relevant and effective.