SemEval-2024 Task 6: SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes

SemEval-2024 Task 6: SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes

2024 | Timothee Mickus, Elaine Zosa, Raúl Vázquez, Teemu Vahtola, Jörg Tiedemann, Vincent Segonne, Alessandro Raganato, Marianna Apidianaki
The SHROOM shared task aimed to detect hallucinations in natural language generation (NLG) systems, which produce fluent but inaccurate outputs. The task involved 58 participants from 42 teams, with 27 submitting system description papers. A dataset of 4000 model outputs, labeled by 5 annotators each, was used across three NLG tasks: machine translation (MT), paraphrase generation (PG), and definition modeling (DM). Participants were asked to classify outputs as hallucinations or not in two tracks: model-aware (with access to the model) and model-agnostic (without access). The shared task revealed that while many systems outperformed a baseline, top-performing systems still struggled with challenging cases, suggesting that hallucinations are a graded phenomenon rather than a binary one. The results showed that accuracy and calibration were key metrics, with top systems achieving around 71% accuracy on the model-agnostic track and 64% on the model-aware track. Participants used a variety of methods, including fine-tuning, prompt engineering, and ensemble techniques. Notably, systems based on closed-source models like GPT-3.5 and GPT-4 performed well, but this was not a strict requirement. Many top systems relied on fine-tuning or ensembling, indicating that high performance requires adaptation rather than off-the-shelf models. The task also highlighted the challenges of detecting hallucinations in ambiguous cases, where annotators often disagreed. The results underscore the need for further research to improve hallucination detection, especially in diverse languages and contexts. The SHROOM task provided a valuable dataset and framework for future studies on hallucination detection in NLG systems.The SHROOM shared task aimed to detect hallucinations in natural language generation (NLG) systems, which produce fluent but inaccurate outputs. The task involved 58 participants from 42 teams, with 27 submitting system description papers. A dataset of 4000 model outputs, labeled by 5 annotators each, was used across three NLG tasks: machine translation (MT), paraphrase generation (PG), and definition modeling (DM). Participants were asked to classify outputs as hallucinations or not in two tracks: model-aware (with access to the model) and model-agnostic (without access). The shared task revealed that while many systems outperformed a baseline, top-performing systems still struggled with challenging cases, suggesting that hallucinations are a graded phenomenon rather than a binary one. The results showed that accuracy and calibration were key metrics, with top systems achieving around 71% accuracy on the model-agnostic track and 64% on the model-aware track. Participants used a variety of methods, including fine-tuning, prompt engineering, and ensemble techniques. Notably, systems based on closed-source models like GPT-3.5 and GPT-4 performed well, but this was not a strict requirement. Many top systems relied on fine-tuning or ensembling, indicating that high performance requires adaptation rather than off-the-shelf models. The task also highlighted the challenges of detecting hallucinations in ambiguous cases, where annotators often disagreed. The results underscore the need for further research to improve hallucination detection, especially in diverse languages and contexts. The SHROOM task provided a valuable dataset and framework for future studies on hallucination detection in NLG systems.
Reach us at info@study.space