Performance of a Large Language Model in Screening Citations

Performance of a Large Language Model in Screening Citations

2024 | Takehiko Oami, MD, PhD; Yohei Okada, MD, PhD; Taka-aki Nakada, MD, PhD
This study evaluates the accuracy and efficiency of large language models (LLMs) in screening citations for systematic reviews. The research team tested an LLM (GPT-4 Turbo) against conventional methods for screening titles and abstracts of 5 clinical questions (CQs) in the development of the Japanese Clinical Practice Guidelines for Management of Sepsis and Septic Shock. The LLM was used to decide whether to include or exclude citations based on inclusion and exclusion criteria related to patient, population, problem; intervention; comparison; and study design. The results showed that the LLM-assisted screening had a sensitivity of 0.75 (95% CI, 0.43 to 0.92) and specificity of 0.99 (95% CI, 0.99 to 0.99) in the primary analysis. After modifying the command prompt, the integrated sensitivity improved to 0.91 (95% CI, 0.77 to 0.97) without significantly compromising specificity (0.98 [95% CI, 0.96 to 0.99]). The LLM-assisted screening was also significantly faster, taking 1.3 minutes for 100 studies compared to 17.2 minutes for conventional methods. The study concluded that LLM-assisted citation screening demonstrated acceptable sensitivity and reasonably high specificity with reduced processing time, suggesting it could enhance efficiency and reduce workload in systematic reviews. The study also explored post hoc modifications, including a majority-vote strategy and a chain-of-thought strategy, which further improved the accuracy of the LLM-assisted screening. However, the study had limitations, including the narrow scope of the medical setting and the potential for misclassification of the reference standard. Despite these limitations, the integration of LLMs into systematic reviews shows promise for enhancing speed and breadth of knowledge synthesis.This study evaluates the accuracy and efficiency of large language models (LLMs) in screening citations for systematic reviews. The research team tested an LLM (GPT-4 Turbo) against conventional methods for screening titles and abstracts of 5 clinical questions (CQs) in the development of the Japanese Clinical Practice Guidelines for Management of Sepsis and Septic Shock. The LLM was used to decide whether to include or exclude citations based on inclusion and exclusion criteria related to patient, population, problem; intervention; comparison; and study design. The results showed that the LLM-assisted screening had a sensitivity of 0.75 (95% CI, 0.43 to 0.92) and specificity of 0.99 (95% CI, 0.99 to 0.99) in the primary analysis. After modifying the command prompt, the integrated sensitivity improved to 0.91 (95% CI, 0.77 to 0.97) without significantly compromising specificity (0.98 [95% CI, 0.96 to 0.99]). The LLM-assisted screening was also significantly faster, taking 1.3 minutes for 100 studies compared to 17.2 minutes for conventional methods. The study concluded that LLM-assisted citation screening demonstrated acceptable sensitivity and reasonably high specificity with reduced processing time, suggesting it could enhance efficiency and reduce workload in systematic reviews. The study also explored post hoc modifications, including a majority-vote strategy and a chain-of-thought strategy, which further improved the accuracy of the LLM-assisted screening. However, the study had limitations, including the narrow scope of the medical setting and the potential for misclassification of the reference standard. Despite these limitations, the integration of LLMs into systematic reviews shows promise for enhancing speed and breadth of knowledge synthesis.
Reach us at info@futurestudyspace.com