[slides and audio] Performance of a Large Language Model in Screening Citations

This study evaluates the performance of a large language model (LLM) in screening citations for systematic reviews. The LLM, specifically GPT-4 Turbo, was used to assist in the title and abstract screening process for five clinical questions (CQs) in the development of the Japanese Clinical Practice Guidelines for Management of Sepsis and Septic Shock. The LLM's accuracy and efficiency were compared with conventional methods. - **Design**: Prospective diagnostic study. - **Participants**: Data from the title and abstract screening process for five CQs. - **Exposure**: LLM-assisted citation screening vs. conventional method. - **Main Outcomes**: Sensitivity, specificity, and screening time. - **Conventional Citation Screening**: 8 of 5634 publications in CQ 1, 4 of 3418 in CQ 2, 4 of 1038 in CQ 3, 17 of 4326 in CQ 4, and 8 of 2253 in CQ 5 were selected. - **LLM-Assisted Citation Screening**: Integrated sensitivity of 0.75 (95% CI, 0.43 to 0.92) and specificity of 0.99 (95% CI, 0.99 to 0.99) in primary analysis. - **Post Hoc Modifications**: Integrated sensitivity improved to 0.91 (95% CI, 0.77 to 0.97) without significantly compromising specificity (0.98 [95% CI, 0.96 to 0.99]). - **Screening Time**: LLM-assisted screening reduced processing time for 100 studies from 17.2 minutes to 1.3 minutes. The study found that LLM-assisted citation screening demonstrated acceptable sensitivity and reasonably high specificity, with a significant reduction in processing time. This method could potentially enhance efficiency and reduce workload in systematic reviews. - **Sensitivity and Specificity**: LLM-assisted screening showed higher sensitivity and maintained high specificity. - **Processing Time**: Significantly reduced processing time for 100 studies. - **Prompt Engineering**: Post hoc modifications improved accuracy. - **Future Research**: Further validation and exploration of prompt engineering techniques are needed. - **Single Medical Setting**: Limited to clinical guideline development. - **Model Updates**: Quality depends on regular updates. - **Reference Standard**: Selected by a limited number of experts. - **Masking**: Authors were not masked to the standard reference. - **Proof-of-Concept Stage**: Further research is needed for practical deployment.This study evaluates the performance of a large language model (LLM) in screening citations for systematic reviews. The LLM, specifically GPT-4 Turbo, was used to assist in the title and abstract screening process for five clinical questions (CQs) in the development of the Japanese Clinical Practice Guidelines for Management of Sepsis and Septic Shock. The LLM's accuracy and efficiency were compared with conventional methods. - **Design**: Prospective diagnostic study. - **Participants**: Data from the title and abstract screening process for five CQs. - **Exposure**: LLM-assisted citation screening vs. conventional method. - **Main Outcomes**: Sensitivity, specificity, and screening time. - **Conventional Citation Screening**: 8 of 5634 publications in CQ 1, 4 of 3418 in CQ 2, 4 of 1038 in CQ 3, 17 of 4326 in CQ 4, and 8 of 2253 in CQ 5 were selected. - **LLM-Assisted Citation Screening**: Integrated sensitivity of 0.75 (95% CI, 0.43 to 0.92) and specificity of 0.99 (95% CI, 0.99 to 0.99) in primary analysis. - **Post Hoc Modifications**: Integrated sensitivity improved to 0.91 (95% CI, 0.77 to 0.97) without significantly compromising specificity (0.98 [95% CI, 0.96 to 0.99]). - **Screening Time**: LLM-assisted screening reduced processing time for 100 studies from 17.2 minutes to 1.3 minutes. The study found that LLM-assisted citation screening demonstrated acceptable sensitivity and reasonably high specificity, with a significant reduction in processing time. This method could potentially enhance efficiency and reduce workload in systematic reviews. - **Sensitivity and Specificity**: LLM-assisted screening showed higher sensitivity and maintained high specificity. - **Processing Time**: Significantly reduced processing time for 100 studies. - **Prompt Engineering**: Post hoc modifications improved accuracy. - **Future Research**: Further validation and exploration of prompt engineering techniques are needed. - **Single Medical Setting**: Limited to clinical guideline development. - **Model Updates**: Quality depends on regular updates. - **Reference Standard**: Selected by a limited number of experts. - **Masking**: Authors were not masked to the standard reference. - **Proof-of-Concept Stage**: Further research is needed for practical deployment.

Performance of a Large Language Model in Screening Citations

2024 | Takehiko Oami, MD, PhD; Yohei Okada, MD, PhD; Taka-aki Nakada, MD, PhD