Understanding Zero-shot Generative Large Language Models for Systematic Review Screening Automation

This paper investigates the effectiveness of using zero-shot large language models (LLMs) for automating the screening phase of systematic reviews. Systematic reviews are crucial in evidence-based medicine, but they are resource-intensive, especially in the screening phase. The study evaluates eight different LLMs and a calibration technique that uses a predefined recall threshold to determine whether a publication should be included in a review. The evaluation, using five standard test collections, shows that instruction fine-tuning is important for screening, calibration makes LLMs practical for achieving targeted recall, and combining both with an ensemble of zero-shot models saves significant screening time compared to state-of-the-art approaches. The best model, LlaMa2-7b-ins, outperforms other models in terms of recall and success rate. The calibrated ensemble method, which combines the top zero-shot LLMs with a BioBERT baseline, achieves the highest balanced accuracy and work-saved by sampling at a specific recall level. The findings suggest that LLM-based methods can significantly reduce manual effort in systematic review workflows while maintaining high recall levels across different types of reviews.This paper investigates the effectiveness of using zero-shot large language models (LLMs) for automating the screening phase of systematic reviews. Systematic reviews are crucial in evidence-based medicine, but they are resource-intensive, especially in the screening phase. The study evaluates eight different LLMs and a calibration technique that uses a predefined recall threshold to determine whether a publication should be included in a review. The evaluation, using five standard test collections, shows that instruction fine-tuning is important for screening, calibration makes LLMs practical for achieving targeted recall, and combining both with an ensemble of zero-shot models saves significant screening time compared to state-of-the-art approaches. The best model, LlaMa2-7b-ins, outperforms other models in terms of recall and success rate. The calibrated ensemble method, which combines the top zero-shot LLMs with a BioBERT baseline, achieves the highest balanced accuracy and work-saved by sampling at a specific recall level. The findings suggest that LLM-based methods can significantly reduce manual effort in systematic review workflows while maintaining high recall levels across different types of reviews.

Zero-shot Generative Large Language Models for Systematic Review Screening Automation

February 2, 2024 | Shuai Wang, Harrisen Scells, Shengyao Zhuang, Martin Potthast, Bevan Koopman, Guido Zuccon