Understanding SemEval-2024 Task 6%3A SHROOM%2C a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes

This paper presents the results of the SHROOM shared task, which focuses on detecting hallucinations in natural language generation (NLG) outputs. Hallucinations are outputs that are fluent but inaccurate, posing a significant risk to NLG applications where correctness is critical. The task was conducted using a dataset of 4000 model outputs from three NLP tasks: machine translation, paraphrase generation, and definition modeling, labeled by 5 annotators each. Over 58 participants from 42 teams submitted predictions on both model-aware and model-agnostic tracks. Key trends include reliance on a few models, synthetic data for fine-tuning, and zero-shot prompting strategies. While most teams outperformed the baseline system, top-scoring systems still struggled with challenging items, suggesting a need for more effective approaches. The study highlights the complexity of hallucination detection and the importance of further research to address current limitations.This paper presents the results of the SHROOM shared task, which focuses on detecting hallucinations in natural language generation (NLG) outputs. Hallucinations are outputs that are fluent but inaccurate, posing a significant risk to NLG applications where correctness is critical. The task was conducted using a dataset of 4000 model outputs from three NLP tasks: machine translation, paraphrase generation, and definition modeling, labeled by 5 annotators each. Over 58 participants from 42 teams submitted predictions on both model-aware and model-agnostic tracks. Key trends include reliance on a few models, synthetic data for fine-tuning, and zero-shot prompting strategies. While most teams outperformed the baseline system, top-scoring systems still struggled with challenging items, suggesting a need for more effective approaches. The study highlights the complexity of hallucination detection and the importance of further research to address current limitations.

SemEval-2024 Task 6: SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes

29 Mar 2024 | Timothee Mickus, Elaine Zosa, Raúl Vázquez, Teemu Vahtola, Jörg Tiedemann, Vincent Segonne, Alessandro Raganato, Marianna Apidianaki