SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials

SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials

2024 | Maël Jullien, Marco Valentino, André Freitas
SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials introduces the NLI4CT-P dataset, designed to challenge large language models (LLMs) with interventional and causal reasoning tasks. The dataset includes perturbed statements to test the semantic consistency and faithfulness of NLI models. A total of 106 participants submitted over 1200 individual submissions and 25 system overview papers. The task aims to improve the robustness and applicability of NLI models in healthcare, ensuring safer and more dependable AI assistance in clinical decision-making. The dataset, competition leaderboard, and website are publicly available. The task involves classifying the inference relation between a CTR premise and a statement as either entailment or contradiction. The dataset includes two types of instances: single instances and comparison instances. The NLI4CT-P dataset includes four interventions applied to the statements, including paraphrasing, numerical paraphrasing, appending text, and contradiction rephrasing. These interventions enable a systematic behavioral and causal analysis of models evaluated in the competition. The evaluation includes the Macro F1-score, Faithfulness, and Consistency metrics. Faithfulness measures the accuracy and grounding of a system's predictions, while Consistency assesses a system's ability to generate identical outcomes for semantically equivalent inputs. The results show that generative models outperform discriminative models in terms of F1, Faithfulness, and Consistency. Additionally, mid-sized architectures (7B to 70B parameters) demonstrate the potential to match or exceed the performance of larger models in F1, Faithfulness, and Consistency, while being more resource and cost-effective. The study highlights the importance of Faithfulness and Consistency evaluation in NLI systems, as they provide deeper insights into a model's interpretative and reasoning proficiency. The results also show that zero-shot prompting strategies outperform few-shot prompting strategies in terms of F1, Faithfulness, and Consistency. Furthermore, instruction tuning is a prevalent strategy, with datasets specifically crafted for this purpose by various teams. The study also discusses the impact of fine-tuning strategies on model performance, with systems fine-tuned on external datasets demonstrating superior performance on all metrics. The results indicate that incorporating perturbed data into the training process enhances the model's inference ability and improves its reliability and adherence to the truthfulness of the clinical data it processes. The study concludes that generative models significantly outperform discriminative models, particularly in terms of Faithfulness and Consistency. The utility of additional data is underscored, especially due to the limited size of the NLI4CT-P training set. Additionally, mid-sized architectures demonstrate the potential to match or even exceed the performance of larger models in F1, Faithfulness, and Consistency, while being more resource and cost-effective. The study also highlights the importance of prompt design in the development and evaluation of NLI systems.SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials introduces the NLI4CT-P dataset, designed to challenge large language models (LLMs) with interventional and causal reasoning tasks. The dataset includes perturbed statements to test the semantic consistency and faithfulness of NLI models. A total of 106 participants submitted over 1200 individual submissions and 25 system overview papers. The task aims to improve the robustness and applicability of NLI models in healthcare, ensuring safer and more dependable AI assistance in clinical decision-making. The dataset, competition leaderboard, and website are publicly available. The task involves classifying the inference relation between a CTR premise and a statement as either entailment or contradiction. The dataset includes two types of instances: single instances and comparison instances. The NLI4CT-P dataset includes four interventions applied to the statements, including paraphrasing, numerical paraphrasing, appending text, and contradiction rephrasing. These interventions enable a systematic behavioral and causal analysis of models evaluated in the competition. The evaluation includes the Macro F1-score, Faithfulness, and Consistency metrics. Faithfulness measures the accuracy and grounding of a system's predictions, while Consistency assesses a system's ability to generate identical outcomes for semantically equivalent inputs. The results show that generative models outperform discriminative models in terms of F1, Faithfulness, and Consistency. Additionally, mid-sized architectures (7B to 70B parameters) demonstrate the potential to match or exceed the performance of larger models in F1, Faithfulness, and Consistency, while being more resource and cost-effective. The study highlights the importance of Faithfulness and Consistency evaluation in NLI systems, as they provide deeper insights into a model's interpretative and reasoning proficiency. The results also show that zero-shot prompting strategies outperform few-shot prompting strategies in terms of F1, Faithfulness, and Consistency. Furthermore, instruction tuning is a prevalent strategy, with datasets specifically crafted for this purpose by various teams. The study also discusses the impact of fine-tuning strategies on model performance, with systems fine-tuned on external datasets demonstrating superior performance on all metrics. The results indicate that incorporating perturbed data into the training process enhances the model's inference ability and improves its reliability and adherence to the truthfulness of the clinical data it processes. The study concludes that generative models significantly outperform discriminative models, particularly in terms of Faithfulness and Consistency. The utility of additional data is underscored, especially due to the limited size of the NLI4CT-P training set. Additionally, mid-sized architectures demonstrate the potential to match or even exceed the performance of larger models in F1, Faithfulness, and Consistency, while being more resource and cost-effective. The study also highlights the importance of prompt design in the development and evaluation of NLI systems.
Reach us at info@study.space