SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense

SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense

22 Apr 2024 | Yifan Jiang, Filip Ilievski, Kaixin Ma
SemEval-2024 Task 9: BRAINTEASER(S) introduces a novel task that challenges models to think laterally, defying common sense. The task is based on the BRAINTEASER benchmark, which evaluates models' ability to solve lateral thinking puzzles in a zero-shot setting. The task was split into two subtasks: Sentence Puzzles (SP) and Word Puzzles (WP), and participants were encouraged to submit solutions. The competition received 483 submissions from 182 teams, with results analyzed to assess models' performance on both subtasks. The BRAINTEASER(S) dataset was constructed through a three-stage pipeline, including data collection, filtering, and transformation into multiple-choice questions. The dataset was further divided into train, trial, and test sets to support both fine-tuning and zero/few-shot settings. The task was evaluated using accuracy metrics, with results showing that the best-performing models achieved high accuracy on both subtasks. However, the analysis revealed that models struggled with consistent reasoning across different question formats, such as semantic and context reconstructions. The competition results highlighted the challenges of lateral thinking tasks, including the need for models to adapt to new question formats and avoid overfitting. The analysis also showed that fine-tuning models performed well on sentence puzzles but had difficulties with word puzzles, while prompting methods showed promise in both subtasks. The study emphasizes the importance of developing models that can reason laterally and adapt to new contexts, as well as the need for further research in this area. The results and analysis of the competition provide insights into the current state of lateral thinking research and the potential for future advancements in this field.SemEval-2024 Task 9: BRAINTEASER(S) introduces a novel task that challenges models to think laterally, defying common sense. The task is based on the BRAINTEASER benchmark, which evaluates models' ability to solve lateral thinking puzzles in a zero-shot setting. The task was split into two subtasks: Sentence Puzzles (SP) and Word Puzzles (WP), and participants were encouraged to submit solutions. The competition received 483 submissions from 182 teams, with results analyzed to assess models' performance on both subtasks. The BRAINTEASER(S) dataset was constructed through a three-stage pipeline, including data collection, filtering, and transformation into multiple-choice questions. The dataset was further divided into train, trial, and test sets to support both fine-tuning and zero/few-shot settings. The task was evaluated using accuracy metrics, with results showing that the best-performing models achieved high accuracy on both subtasks. However, the analysis revealed that models struggled with consistent reasoning across different question formats, such as semantic and context reconstructions. The competition results highlighted the challenges of lateral thinking tasks, including the need for models to adapt to new question formats and avoid overfitting. The analysis also showed that fine-tuning models performed well on sentence puzzles but had difficulties with word puzzles, while prompting methods showed promise in both subtasks. The study emphasizes the importance of developing models that can reason laterally and adapt to new contexts, as well as the need for further research in this area. The results and analysis of the competition provide insights into the current state of lateral thinking research and the potential for future advancements in this field.
Reach us at info@study.space