Understanding SemEval-2024 Task 2%3A Safe Biomedical Natural Language Inference for Clinical Trials

The paper introduces SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials, addressing the limitations of Large Language Models (LLMs) in handling shortcut learning, factual inconsistencies, and adversarial inputs, particularly in medical contexts. The task focuses on evaluating LLMs' robustness and applicability in healthcare through a refined dataset, NL4CT-P, which includes perturbations to challenge models with interventional and causal reasoning tasks. Over 106 participants contributed to the task, submitting over 1200 individual entries and 25 system overview papers. The evaluation framework introduces two new metrics, Consistency and Faithfulness, to assess models' ability to maintain uniformity and accurately capture semantic features, respectively. Key findings include the superior performance of generative models, the importance of additional training data, and the impact of prompting strategies. The study highlights the need for more robust and reliable systems in clinical decision-making and provides insights into the challenges and future directions in Clinical Natural Language Inference (NLI).The paper introduces SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials, addressing the limitations of Large Language Models (LLMs) in handling shortcut learning, factual inconsistencies, and adversarial inputs, particularly in medical contexts. The task focuses on evaluating LLMs' robustness and applicability in healthcare through a refined dataset, NL4CT-P, which includes perturbations to challenge models with interventional and causal reasoning tasks. Over 106 participants contributed to the task, submitting over 1200 individual entries and 25 system overview papers. The evaluation framework introduces two new metrics, Consistency and Faithfulness, to assess models' ability to maintain uniformity and accurately capture semantic features, respectively. Key findings include the superior performance of generative models, the importance of additional training data, and the impact of prompting strategies. The study highlights the need for more robust and reliable systems in clinical decision-making and provides insights into the challenges and future directions in Clinical Natural Language Inference (NLI).

SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials

7 Apr 2024 | Maël Jullien, Marco Valentino, André Freitas