22 Apr 2024 | Yifan Jiang, Filip Ilievski, Kaixin Ma
The paper introduces SemEval Task 9: BRAINTEASER(S), a novel task designed to evaluate systems' lateral thinking abilities. Lateral thinking, which defies common sense and requires unconventional reasoning, has been challenging for current models but has received limited attention. The original BRAINTEASER benchmark, introduced by Jiang et al. (2023c), focuses on zero-shot learning, while BRAINTEASER(S) supports both zero-shot and fine-tuning settings. The task consists of two subtasks: Sentence Puzzle (SP) and Word Puzzle (WP), both evaluated using multiple-choice QA. The dataset is constructed through a three-stage pipeline to ensure validity and challenge systems. During the competition, 182 participants submitted 483 team entries, with detailed analysis provided in this paper. The results show that while top-performing systems achieve high accuracy, they still struggle with consistent lateral thinking, especially in context reconstruction. The paper discusses the differences between the original BRAINTEASER benchmark and BRAINTEASER(S), the effectiveness of different system choices, and the challenges in lateral thinking. It concludes by highlighting the need for further research to enhance lateral thinking abilities in AI models.The paper introduces SemEval Task 9: BRAINTEASER(S), a novel task designed to evaluate systems' lateral thinking abilities. Lateral thinking, which defies common sense and requires unconventional reasoning, has been challenging for current models but has received limited attention. The original BRAINTEASER benchmark, introduced by Jiang et al. (2023c), focuses on zero-shot learning, while BRAINTEASER(S) supports both zero-shot and fine-tuning settings. The task consists of two subtasks: Sentence Puzzle (SP) and Word Puzzle (WP), both evaluated using multiple-choice QA. The dataset is constructed through a three-stage pipeline to ensure validity and challenge systems. During the competition, 182 participants submitted 483 team entries, with detailed analysis provided in this paper. The results show that while top-performing systems achieve high accuracy, they still struggle with consistent lateral thinking, especially in context reconstruction. The paper discusses the differences between the original BRAINTEASER benchmark and BRAINTEASER(S), the effectiveness of different system choices, and the challenges in lateral thinking. It concludes by highlighting the need for further research to enhance lateral thinking abilities in AI models.