February 2, 2024 | Shuai Wang, Harrissen Scells, Shengyao Zhuang, Martin Potthast, Bevan Koopman, Guido Zuccon
This paper investigates the effectiveness of using zero-shot large language models (LLMs) for automatic screening in systematic reviews. Systematic reviews are essential for evidence-based medicine, but the screening process is resource-intensive. The study evaluates eight different LLMs and explores a calibration technique to determine whether a publication should be included in a review. The results show that instruction fine-tuning plays a crucial role in screening, and that calibration makes LLMs practical for achieving a targeted recall. Combining both with an ensemble of zero-shot models saves significant screening time compared to state-of-the-art approaches.
The study focuses on using zero-shot LLMs for automatic screening of documents in systematic reviews. The approach involves two settings: uncalibrated and calibrated. The uncalibrated setting directly uses the token with higher probability between 'yes' and 'no'. The calibrated setting introduces a hyperparameter θ as a new decision boundary, calculated from the difference of the two tokens. θ is adjusted based on starting documents or previous systematic reviews.
The evaluation addresses four research questions: how the architecture and size of the LLMs influence effectiveness, how instruction-based fine-tuning influences effectiveness, how the calibration of the classifier’s decisions with respect to the target tokens’ likelihoods influences effectiveness, and how ensembling LLM-based classifiers and current strong neural baselines influences effectiveness.
The results show that LlaMa2-7b-ins is currently the best model for this task, outperforming the 13b parameter variant. Instruction-based fine-tuning consistently outperforms base models, and models based on LlaMa2 consistently outperform the baseline BERT-based method. The calibrated setting with ensembling achieves the best result overall and approaches the predefined recall target for the test topics, indicating practical use.
The study also compares the effectiveness of zero-shot LLMs against a baseline using the BioBERT architecture. The results show that zero-shot LLMs outperform the baseline in terms of recall and success rate. The calibrated approach with ensembling provides the best performance, achieving high recall and success rates. The study highlights the importance of output calibration when applying generative LLMs to systematic review document screening. Calibration ensures that the model meets pre-set recall targets, maintaining review quality and reliability. The findings suggest that LLM-based methods can be used for automatically screening documents for systematic reviews, leading to considerable savings in manual effort. These methods are also practical as they do not require expensive fine-tuning. The results indicate that these methods might be mature enough for actual adoption in systematic review workflows.This paper investigates the effectiveness of using zero-shot large language models (LLMs) for automatic screening in systematic reviews. Systematic reviews are essential for evidence-based medicine, but the screening process is resource-intensive. The study evaluates eight different LLMs and explores a calibration technique to determine whether a publication should be included in a review. The results show that instruction fine-tuning plays a crucial role in screening, and that calibration makes LLMs practical for achieving a targeted recall. Combining both with an ensemble of zero-shot models saves significant screening time compared to state-of-the-art approaches.
The study focuses on using zero-shot LLMs for automatic screening of documents in systematic reviews. The approach involves two settings: uncalibrated and calibrated. The uncalibrated setting directly uses the token with higher probability between 'yes' and 'no'. The calibrated setting introduces a hyperparameter θ as a new decision boundary, calculated from the difference of the two tokens. θ is adjusted based on starting documents or previous systematic reviews.
The evaluation addresses four research questions: how the architecture and size of the LLMs influence effectiveness, how instruction-based fine-tuning influences effectiveness, how the calibration of the classifier’s decisions with respect to the target tokens’ likelihoods influences effectiveness, and how ensembling LLM-based classifiers and current strong neural baselines influences effectiveness.
The results show that LlaMa2-7b-ins is currently the best model for this task, outperforming the 13b parameter variant. Instruction-based fine-tuning consistently outperforms base models, and models based on LlaMa2 consistently outperform the baseline BERT-based method. The calibrated setting with ensembling achieves the best result overall and approaches the predefined recall target for the test topics, indicating practical use.
The study also compares the effectiveness of zero-shot LLMs against a baseline using the BioBERT architecture. The results show that zero-shot LLMs outperform the baseline in terms of recall and success rate. The calibrated approach with ensembling provides the best performance, achieving high recall and success rates. The study highlights the importance of output calibration when applying generative LLMs to systematic review document screening. Calibration ensures that the model meets pre-set recall targets, maintaining review quality and reliability. The findings suggest that LLM-based methods can be used for automatically screening documents for systematic reviews, leading to considerable savings in manual effort. These methods are also practical as they do not require expensive fine-tuning. The results indicate that these methods might be mature enough for actual adoption in systematic review workflows.