Assessing the Risk of Bias in Randomized Clinical Trials With Large Language Models

Assessing the Risk of Bias in Randomized Clinical Trials With Large Language Models

May 22, 2024 | Honghao Lai, MM; Long Ge, MD; Mingyao Sun, MSN; Bei Pan, MD; Jiajie Huang, MSN; Liangying Hou, MD; Qiuyu Yang, MD; Jiayi Liu, MM; Jianing Liu, MSN; Ziyi Ye, MM; Danni Xia, MM; Weilong Zhao, MM; Xiaomian Wang, MD; Ming Liu, MD; Jhalak Ronjan Talukdar, PhD; Jinhui Tian, MD; Kehu Yang, MD; Janne Estill, PhD
This study evaluates the feasibility and reliability of using large language models (LLMs) to assess risk of bias (ROB) in randomized clinical trials (RCTs). The researchers used two LLMs, ChatGPT (LLM 1) and Claude (LLM 2), to assess ROB in 30 RCTs selected from published systematic reviews. A structured prompt was developed to guide the LLMs in applying a modified version of the Cochrane ROB tool. Each RCT was assessed twice by both models, and the results were compared with assessments by three expert reviewers. The accuracy, consistency, and efficiency of the LLMs were measured. Both LLMs demonstrated high accuracy in assessing ROB, with LLM 1 achieving a mean correct assessment rate of 84.5% and LLM 2 achieving 89.5%. The risk difference between the two models was 0.05, indicating that LLM 2 performed slightly better. In most domains, the correct assessment rates were between 80% and 90%, although sensitivity was lower in domains related to random sequence generation, allocation concealment, and other concerns. The consistency between the two assessments was high, with LLM 1 showing a consistent rate of 84.0% and LLM 2 showing 87.3%. Both models had high kappa values, indicating strong agreement with the expert assessments. The assessment time was significantly shorter for LLM 2, with an average of 53 seconds per assessment, compared to 77 seconds for LLM 1. The study found that LLMs can serve as supportive tools in systematic review processes, offering high accuracy and consistency in assessing ROB in RCTs. However, the study also identified limitations, including the potential for bias due to different methods of submitting articles and the need for further research to address the challenges of assessing complex domains. Overall, the study suggests that LLMs have the potential to enhance the efficiency and accuracy of ROB assessments in systematic reviews.This study evaluates the feasibility and reliability of using large language models (LLMs) to assess risk of bias (ROB) in randomized clinical trials (RCTs). The researchers used two LLMs, ChatGPT (LLM 1) and Claude (LLM 2), to assess ROB in 30 RCTs selected from published systematic reviews. A structured prompt was developed to guide the LLMs in applying a modified version of the Cochrane ROB tool. Each RCT was assessed twice by both models, and the results were compared with assessments by three expert reviewers. The accuracy, consistency, and efficiency of the LLMs were measured. Both LLMs demonstrated high accuracy in assessing ROB, with LLM 1 achieving a mean correct assessment rate of 84.5% and LLM 2 achieving 89.5%. The risk difference between the two models was 0.05, indicating that LLM 2 performed slightly better. In most domains, the correct assessment rates were between 80% and 90%, although sensitivity was lower in domains related to random sequence generation, allocation concealment, and other concerns. The consistency between the two assessments was high, with LLM 1 showing a consistent rate of 84.0% and LLM 2 showing 87.3%. Both models had high kappa values, indicating strong agreement with the expert assessments. The assessment time was significantly shorter for LLM 2, with an average of 53 seconds per assessment, compared to 77 seconds for LLM 1. The study found that LLMs can serve as supportive tools in systematic review processes, offering high accuracy and consistency in assessing ROB in RCTs. However, the study also identified limitations, including the potential for bias due to different methods of submitting articles and the need for further research to address the challenges of assessing complex domains. Overall, the study suggests that LLMs have the potential to enhance the efficiency and accuracy of ROB assessments in systematic reviews.
Reach us at info@study.space
[slides and audio] Assessing the Risk of Bias in Randomized Clinical Trials With Large Language Models