6 May 2024 | Zhiying Zhu, Yiming Yang, Zhiqing Sun
HaluEval-Wild is a new benchmark designed to evaluate hallucinations in large language models (LLMs) in real-world scenarios. The benchmark is built from challenging user queries collected from real-world interactions between users and LLMs, specifically from the ShareGPT dataset. These queries are filtered to ensure they challenge the model's knowledge and reasoning capabilities. The queries are categorized into five types: Out-of-Scope Information (OoS), Complex Reasoning (CR), Inappropriate Content (IC), Beyond-Modality Interaction (BM), and Confused / Erroneous Queries (CE). Reference answers for these queries are generated using GPT-4 and retrieval-augmented generation (RAG) to ensure accuracy.
The benchmark evaluates various LLMs, including open-source models like Alpaca, Vicuna, Llama 2, Mistral, and Mixtral, as well as closed-source models like GPT-4 and GPT-3.5. The results show that models like Alpaca 7B have high hallucination rates, while GPT-4 Turbo has a lower hallucination rate, indicating better reliability. Knowledge-distilled models, such as Vicuna-13B, are more prone to hallucinations, highlighting the challenge of balancing model performance with reliability.
The benchmark also explores methods to mitigate hallucinations, such as self-reflection, which involves using textual feedback from prior errors to improve LLM responses. The results show that self-reflection reduces hallucination rates, especially when additional hints are provided.
HaluEval-Wild provides a comprehensive benchmark for evaluating LLM hallucinations in real-world scenarios, offering insights into the capabilities and limitations of different models. The benchmark is available at https://github.com/Dianezzy/HaluEval-Wild. The study also acknowledges the limitations of the benchmark, including its focus on challenging queries and potential biases in categorization. Continuous updates and refinements are needed to ensure the benchmark remains relevant and effective in assessing LLM performance and reliability.HaluEval-Wild is a new benchmark designed to evaluate hallucinations in large language models (LLMs) in real-world scenarios. The benchmark is built from challenging user queries collected from real-world interactions between users and LLMs, specifically from the ShareGPT dataset. These queries are filtered to ensure they challenge the model's knowledge and reasoning capabilities. The queries are categorized into five types: Out-of-Scope Information (OoS), Complex Reasoning (CR), Inappropriate Content (IC), Beyond-Modality Interaction (BM), and Confused / Erroneous Queries (CE). Reference answers for these queries are generated using GPT-4 and retrieval-augmented generation (RAG) to ensure accuracy.
The benchmark evaluates various LLMs, including open-source models like Alpaca, Vicuna, Llama 2, Mistral, and Mixtral, as well as closed-source models like GPT-4 and GPT-3.5. The results show that models like Alpaca 7B have high hallucination rates, while GPT-4 Turbo has a lower hallucination rate, indicating better reliability. Knowledge-distilled models, such as Vicuna-13B, are more prone to hallucinations, highlighting the challenge of balancing model performance with reliability.
The benchmark also explores methods to mitigate hallucinations, such as self-reflection, which involves using textual feedback from prior errors to improve LLM responses. The results show that self-reflection reduces hallucination rates, especially when additional hints are provided.
HaluEval-Wild provides a comprehensive benchmark for evaluating LLM hallucinations in real-world scenarios, offering insights into the capabilities and limitations of different models. The benchmark is available at https://github.com/Dianezzy/HaluEval-Wild. The study also acknowledges the limitations of the benchmark, including its focus on challenging queries and potential biases in categorization. Continuous updates and refinements are needed to ensure the benchmark remains relevant and effective in assessing LLM performance and reliability.